BD2KGenomics / toil-rnaseq

UC Santa Cruz Computational Genomics Lab's Toil-based RNA-seq pipeline
Apache License 2.0
38 stars 10 forks source link

quantification for stranded data #177

Open yuankunzhu opened 5 years ago

yuankunzhu commented 5 years ago

--forward-prob was hard set to 0.5, while the documentation of that argument describes as:

Probability of generating a read from the forward strand of a transcript. Set to 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, or 0.5 for a non-strand-specific protocol. (Default: 0.5)

Should make this as a variable associated with the stranded status

actual code line: https://github.com/BD2KGenomics/toil-rnaseq/blob/master/src/toil_rnaseq/tools/quantifiers.py#L82

jvivian commented 5 years ago

@yuankunzhu — to clarify, you'd like to be able to modify this setting?

yuankunzhu commented 5 years ago

ultimately, this parameter should be set up according to the lib stranded status. So if the input data is stranded, such parameter should be 1 or 0; and if it's non-stranded, then 0.5 for example.

jvivian commented 5 years ago

@yuankunzhu — I see, thank you for the explanation. I'll look into how easy / fast it is to ascertain stranded status and see if I can add it to the workflow. If you have a fast tool you can recommend that'd be appreciated.

jvivian commented 5 years ago

This tool has a strand checker: https://hartleys.github.io/QoRTs/ but only works on BAM input files.

yuankunzhu commented 5 years ago

Thanks for looking into this @jvivian. I know Salmon could do such check up too: https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype

As of version 0.7.0, Salmon also has the ability to automatically infer (i.e. guess) the library type based on how the first few thousand reads map to the transcriptome. To allow Salmon to automatically infer the library type, simply provide -l A or --libType A to Salmon.

hbeale commented 5 years ago

@yuankunzhu, I looked at the Salmon note too, but it can only detect what the aligner was told the data was, not whether the sequence data itself came from a stranded or unstranded library. I'm pretty sure this will have to be a parameter based on a human's knowledge of the library prep.

"Thus, for example, if the upstream aligner has been told to perform strand-aware mapping (i.e. to ignore potential alignments that don’t map in the expected manner), but the actual library is unstranded, automatic library type detection cannot detect this. It will attempt to detect the library type that is most consistent with the alignment that are provided."