Closed schelhorn closed 7 years ago
Hi Sven-Eric,
I'll implement support for spitting out the TPMs over the next couple of days. Adding HPV genotyping would be awesome.
Great-o! However, better do something else for the next couple of days ;) Merry Christmas!
Hi Sven-Eric,
Just checked and the spike-in quantitation spits out TPM for Sailfish (I just passed in some random transcripts to check) so that should be all set. I'm swapping out Sailfish for Salmon as the default over the break though and I'll make sure this gets done properly there too.
name length effectiveLength tpm numreads sample id
ENSMUST00000082387 68 30.199 0.0 0 Test1 ENSMUST00000082387
ENSMUST00000082388 955 645.941 0.0 0 Test1 ENSMUST00000082388
ENSMUST00000082389 69 29.7192 0.0 0 Test1 ENSMUST00000082389
ENSMUST00000082390 1582 1272.94 0.0 0 Test1 ENSMUST00000082390
ENSMUST00000082391 75 30.3261 0.0 0 Test1 ENSMUST00000082391
ENSMUST00000082392 957 647.941 0.0 0 Test1 ENSMUST00000082392
ENSMUST00000082393 69 29.7192 0.0 0 Test1 ENSMUST00000082393
ENSMUST00000082394 71 29.0293 0.0 0 Test1 ENSMUST00000082394
ENSMUST00000082395 69 29.7192 0.0 0 Test1 ENSMUST00000082395
ENSMUST00000082396 1038 728.941 0.0 0 Test1 ENSMUST00000082396
ENSMUST00000082397 67 30.6947 0.0 0 Test1 ENSMUST00000082397
ENSMUST00000082398 69 29.7192 0.0 0 Test1 ENSMUST00000082398
ENSMUST00000082399 71 29.0293 0.0 0 Test1 ENSMUST00000082399
Excellent, thank you!
So, it seems to work but the summarized output file contains empty tpm
fields (empty strings) if the original salmon files that are summarized have nan
values for the tpm
columns. These fields should be 0
instead, I guess.
I used this ERCC+HPV spike-in file (the HPV part may be used by bcbio if that is of interest, it's from pave and contains the HPV L1 references that are commonly used for clinical genotyping, see genotype in FASTA ID). 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, 82
are the oncogenic genotypes.
Hello @roryk, would it be possible to get TPMs as well as counts from the transcripts in the user-defined spike-in FASTA for both the
salmon
andsailfish
pipelines? If I remember correctly the implementation currently only generates counts, and/or only forsalmon
.We routinely quantitate ERCC and HPV L1 from all HPV reference strains using
sailfish
by adding the spike-ins to the bcbio transcript reference FASTA and GTF files using a custom script, which is a total hack. It would be great to have TPM quantitation for this in the bcbio standard build as well so we could leave the reference untouched.I'd be happy to share the HPV L1 reference FASTA and GTF files in return, so HPV genotyping by RNA-Seq could become a standard feature in bcbio as well - if that is of any interest.