bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Feature request: TPMs for user-defined RNA-Seq spike-ins for both Salmon and Sailfish pipelines #1709

Closed schelhorn closed 7 years ago

schelhorn commented 7 years ago

Hello @roryk, would it be possible to get TPMs as well as counts from the transcripts in the user-defined spike-in FASTA for both the salmon and sailfish pipelines? If I remember correctly the implementation currently only generates counts, and/or only for salmon.

We routinely quantitate ERCC and HPV L1 from all HPV reference strains using sailfish by adding the spike-ins to the bcbio transcript reference FASTA and GTF files using a custom script, which is a total hack. It would be great to have TPM quantitation for this in the bcbio standard build as well so we could leave the reference untouched.

I'd be happy to share the HPV L1 reference FASTA and GTF files in return, so HPV genotyping by RNA-Seq could become a standard feature in bcbio as well - if that is of any interest.

roryk commented 7 years ago

Hi Sven-Eric,

I'll implement support for spitting out the TPMs over the next couple of days. Adding HPV genotyping would be awesome.

schelhorn commented 7 years ago

Great-o! However, better do something else for the next couple of days ;) Merry Christmas!

roryk commented 7 years ago

Hi Sven-Eric,

Just checked and the spike-in quantitation spits out TPM for Sailfish (I just passed in some random transcripts to check) so that should be all set. I'm swapping out Sailfish for Salmon as the default over the break though and I'll make sure this gets done properly there too.

name    length  effectiveLength tpm numreads    sample  id
ENSMUST00000082387  68  30.199  0.0 0   Test1   ENSMUST00000082387
ENSMUST00000082388  955 645.941 0.0 0   Test1   ENSMUST00000082388
ENSMUST00000082389  69  29.7192 0.0 0   Test1   ENSMUST00000082389
ENSMUST00000082390  1582    1272.94 0.0 0   Test1   ENSMUST00000082390
ENSMUST00000082391  75  30.3261 0.0 0   Test1   ENSMUST00000082391
ENSMUST00000082392  957 647.941 0.0 0   Test1   ENSMUST00000082392
ENSMUST00000082393  69  29.7192 0.0 0   Test1   ENSMUST00000082393
ENSMUST00000082394  71  29.0293 0.0 0   Test1   ENSMUST00000082394
ENSMUST00000082395  69  29.7192 0.0 0   Test1   ENSMUST00000082395
ENSMUST00000082396  1038    728.941 0.0 0   Test1   ENSMUST00000082396
ENSMUST00000082397  67  30.6947 0.0 0   Test1   ENSMUST00000082397
ENSMUST00000082398  69  29.7192 0.0 0   Test1   ENSMUST00000082398
ENSMUST00000082399  71  29.0293 0.0 0   Test1   ENSMUST00000082399
schelhorn commented 7 years ago

Excellent, thank you!

schelhorn commented 7 years ago

So, it seems to work but the summarized output file contains empty tpm fields (empty strings) if the original salmon files that are summarized have nan values for the tpm columns. These fields should be 0 instead, I guess.

I used this ERCC+HPV spike-in file (the HPV part may be used by bcbio if that is of interest, it's from pave and contains the HPV L1 references that are commonly used for clinical genotyping, see genotype in FASTA ID). 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, 82 are the oncogenic genotypes.