COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
771 stars 161 forks source link

Lexogen QuantSeq-Analysis -- Removing Transcript length normalisation #108

Open jangehlen opened 7 years ago

jangehlen commented 7 years ago

Hello, i'm currently dealing with RNA-seq data generated with Lexogen QuantSeq, as ana laternative method to microarrays (https://www.lexogen.com/quantseq-3mrna-sequencing/). Briefly, it's a kit to generate libraries of sequences close to the 3'-UTR with poly-dT-primers in the first PCR. Thus, only one specific fragment is amplified from a transcript. Therefore, the number of generated reads is said to be independent of the transcript length.

This leads to my question, wether there is a possible to quantify QuantSeq-data with salmon without a transcript-length-based normalisation or to remove the length normalisation afterwards by some "de-normalisation"- factor ?

Thank you very much for your help !

Best, Jan

rob-p commented 7 years ago

Hi @jan-g1,

The length of a feature is used during inference to determine the likelihood that multimapping reads should be allocated to different targets. You're describing what is essentially a simplified model where P(f | t) (i.e., the probability of a fragment given a transcript) is independent of length(t). There's currently no option to disable length normalization completely in Salmon, and you can't "de-normalize" by simply multiplying by a factor because those weights are considered during each and every round of the EM (or VBEM) algorithm.

However, supporting this should actually be very straight-forward. We simply assign a uniform and identical length to all transcripts for the purpose of inference. I can add such a flag in the next release, though it will initially have to be incompatible with bias correction (since it's not clear right now how the biases for which we account interact with this type of sequencing).

Also, it would be possible to run salmon with --dumpEq, and then to have a little script / tool that simply re-runs the EM, but without different length factors, using the equivalence class file. I might be able to hack something like that together on short notice if you'd be interested in testing it out.

--Rob

jangehlen commented 7 years ago

Hi @rob-p ,

thank you very much for your quick and detailed answer. I really appreciate that you will include this feature in your next release. Indeed, I'm interested in testing out your suggested script/tool !

Best, Jan

maximilianh commented 7 years ago

This may also be applicable to 10x Genomics reads...

vals commented 7 years ago

@maximilianh If you ignore the UMI's yes.

roryk commented 7 years ago

rapmap -> dedupe umis -> feed deduped alignments back into Salmon would be pretty awesome.

vals commented 7 years ago

I can see the point of deduping reads based on UMI's. But then you'd need to do it on read level rather than mapping level (i.e. CIGAR-free alignment). Maybe then doing it with STAR rather than RapMap? I think I even saw someone who changed Picard Dedup at some point to do this based on UMI's.

rob-p commented 7 years ago

Hi @vals, could you expand I bit more on what distinction is important between the read level and the mapping (cigar-free aln) level? Is it just that one read could yield many mappings, not to be considered as dups, or something else?

vals commented 7 years ago

Hi @rob-p ,

Yes that's one aspect. But also, Salmon uses CIGAR to evaluate alignment probability in alignment quantification mode no?

And with just RapMap output you would lose other information that Salmon uses to determine likely fragment assignment?

With UMI's you can deduplicate fragments before inferring where they were likely to come from.

Ideally you would deduplicate the reads directly based on UMI, then you wouldn't have to think about PCR duplication in the quantification. But of course keeping a hash of all reads in a FASTQ and accounting for dequencing errors wouldn't be really tractable..

rob-p commented 7 years ago

Yes that's one aspect. But also, Salmon uses CIGAR to evaluate alignment probability in alignment quantification mode no?

Indeed.

And with just RapMap output you would lose other information that Salmon uses to determine likely fragment assignment?

You would lose information (in the format of a CIGAR string) that Salmon uses in alignment mode, but not any information, I think, that Salmon uses in quasi-mapping-based mode (though one would incur a non-trivial performance hit for filtering the quasi-mappings through file / disk rather than dealing with them directly in memory as Salmon normally does).

With UMI's you can deduplicate fragments before inferring where they were likely to come from. Ideally you would deduplicate the reads directly based on UMI, then you wouldn't have to think about PCR duplication in the quantification. But of course keeping a hash of all reads in a FASTQ and accounting for dequencing errors wouldn't be really tractable..

I guess this is the real question I have. Specifically, what is the true computational burden to detect and eliminate duplicates using UMIs? In theory, the reads must (1) map to the same location and (2) have the same UMI tag. How often would one expect the UMI tag to be modified / corrupted / etc.? Would you have to search all 1 or 2 hamming distance neighbors to detect duplicates reliably? Is an equivalence class a sufficient proxy for "mapping to the same location", or do we also care that e.g. the position of the fragment within each transcript is a duplicate as well? These are the main questions that are preventing me from implementing the "obvious solution".

vals commented 7 years ago

I don't think sequencing errors of the UMI's is a big issue. I was thinking more of the reads.

I've been thinking in terms of (read, UMI) pairs and (transcript, UMI) pairs, and what the relation between them is.

Now, during mapping, an equivalence class get a count every time a read is mapped to it right? How feasible would it be to keep track of all UMI's which have been assigned to an equivalence task?

maximilianh commented 7 years ago

Counting (transcript, UMIs) is what "kallisto pseudo" with the --umi option does, right?

Yes, there are a few errors in the UMIs. The Kallisto wrapper tries to correct them. But this is really very rare. Not sure if it's worth the time. Like ~100 out of 200.000 reads? I would have to check again.

vals commented 7 years ago

Does it actually do counting? I thought it stopped at (Eq.Class, UMI) counts.

maximilianh commented 7 years ago

Oh Sorry you're right: it does only (equality class, Umi count)

On Dec 23, 2016 1:16 PM, "Valentine Svensson" notifications@github.com wrote:

Does it actually do counting? I thought it stopped at (Eq.Class, UMI) counts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/COMBINE-lab/salmon/issues/108#issuecomment-269046453, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TRKT9ot6qSR96TfF1Tzur4J1t2uIks5rLDoSgaJpZM4K_U-P .

rob-p commented 7 years ago

On Dec 23, 2016, at 3:35 PM, Valentine Svensson notifications@github.com wrote:

I don't think sequencing errors of the UMI's is a big issue. I was thinking more of the reads.

I've been thinking in terms of (read, UMI) pairs and (transcript, UMI) pairs, and what the relation between them is.

Now, during mapping, an equivalence class get a count every time a read is mapped to it right? How feasible would it be to keep track of all UMI's which have been assigned to an equivalence task?

It would be almost trivial. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

vals commented 7 years ago

All UMI/Tag data is single-ended, so only the SE model of quantification would apply (meaning no fragment length inference). Also positional bias and and effective length should be kept constant in this data. And the GC bias doesn't work with SE anyway. So I think a reasonable strategy would be to only count unique UMIs in Equiv.Classes and pass that to the SE Salmon model. That is, if a read has a UMI which is already present in the EC, just throw it away without counting.

Or is there something else in the single-end Salmon model I'm forgetting that would be affected?

roryk commented 7 years ago

+1 for the deduping on EC plan.

maximilianh commented 7 years ago

+1, especially for throwing away reads with an UMI that has been seen before.

rob-p commented 7 years ago

OK, so it seems like adding the UMI <-> eq class counting should be pretty straightforward. @vals, so when you say effective length should be kept constant, you mean we shouldn't adjust for bias, or that we don't expect a transcript length effect at all? Also, is there a basic primer I can read up on regarding the details of the data format (e.g. where in each read I should look for the tag, how long the tag is expected to be, etc.)? I'll be relying on you guys to test stuff out and point me at relevant data sets as I go forward with implementing this.

roryk commented 7 years ago

Hi Rob,

All of the different single-cell library prep methods put the cellular barcodes and UMI in different places; what we've been doing is sticking that information in the read name so we have access to it post alignment. Valentine has this awesome repository: https://github.com/vals/umis that does that part via fastqtransform, and has example regexes for a bunch of the different commonly used chemistries:

https://github.com/vals/umis/tree/master/examples

It might be easiest to support the format that Vals is spitting out and assume the user has stuck that information in the read name as a first go rather than re-implement that stuff.

vals commented 7 years ago

Hey Rob,

I mean there shouldn't be an effect from transcript length. I'm sure there will be other biases, but which other biases can you account for in single-end data?

Yeah unfortunately I think my list of examples might be the best resource for the common file formats. I realize the regular expressions are not extremely parsable. I might make a cleaner visualization of the read layouts (with references) if I feel like playing with Illustrator for a while, but it's a lot going on now.

rob-p commented 7 years ago

Hi Valentine,

Thanks, I'm taking a look through your umis repo that Rory pointed me at. The fact that there is no length effect (i.e. that fragments are drawn from transcripts independent of transcript length) makes the model a little bit different than even the single end abundance model. There, you still expect longer transcripts to generate more fragments _a priori, but to ignore transcript length completely, you essentially want to just treat every transcript as if it has the same effective length. That's not hard to do in the existing model, it just requires setting the length differently in a few places.

Edit: It also looks like, in some of those examples, the transformations are being done on _1, _2 file pairs. Does this have a different meaning in UMI data, or is something else going on here?

vals commented 7 years ago

Hey Rob,

Regarding file pairs, in some cases e.g. the FW read contains the mRNA Tag and a Cell Barcode, while the RV read contains the UMI and some non-informative sequence. In particular in the 10X Chromium (V1) protocol there are 4 files! FW, RV as well as I5 and I7 index reads. I don't remember exactly which read contains what information right now though.

roryk commented 7 years ago

The Klein group has a new inDrop protocol that has the 4 file format as well that should be cropping up in publications sometime soon. Not sure if it is nicked from the 10x folks but it is:

{
    "read1": "(?P<name>[^\\s]+).*\\n(?P<seq>.*)\\n\\+(.*)\\n(?P<qual>.*)\\n",
    "read2": "(.*)\\n(?P<CB1>.*)\\n(.*)\\n(.*)\\n",
    "read3": "(.*)\\n(?P<SB>.*)\\n(.*)\\n(.*)\\n",
    "read4": "(.*)\\n(?P<CB2>.{8})(?P<MB>.{6})\\n(.*)\\n(.*)\\n"
}
roryk commented 7 years ago

SB = sample barcode, CB1 = first part of the cellular barcode, CB2 = second part of the cellular barcode, MB = molecular barcode, seq = biological sequence

rob-p commented 7 years ago

Cool, thanks for clarifying that; it makes sense. So now my question becomes, when we have a stream of reads with all of this information encoded, what should the output look like? That is, if I have a stream of reads where each read has a cell barcode and a umi (with that cell barcode), then, presumably, the output estimates we want are per-cell (i.e. the estimate of the number of UMI de-duplicated reads coming from each transcript within each cell). Is this correct?

roryk commented 7 years ago

Yup, exactly right.

rob-p commented 7 years ago

Ok so 0.8.0 contains the option to turn off all length correction, but I've pushed back the milestone for fully supporting the barcoding. This is because there's already a lot of new stuff in 0.8.0 and I needed to cut a release to coincide with the paper. Just dropping in here to explain the modified milestone and re-iterate my interest in and commitment to this feature ;P.

roryk commented 7 years ago

Thanks Rob, being able to quantitate UMI tagged data in a non-crappy way would be awesome. Right now we are doing something that is FeatureCounts-like for the most part.

mjafin commented 7 years ago

Just chiming in, we have bulk RNA-seq data using a NuGen protocol that has the UMI (for paired end data) in the index read. This is very common in the DNA-seq circles so not a one-off protocol. Would be nice if UMIs could be used for PE reads in distinguishing duplicates.

jsialar commented 6 years ago

Thanks Rob, for including the option to turn off length correction. Can I say that with this option turned on, a multi-mapping read will get divided equally among all the targets?