Open jangehlen opened 7 years ago
Hi @jan-g1,
The length of a feature is used during inference to determine the likelihood that multimapping reads should be allocated to different targets. You're describing what is essentially a simplified model where P(f | t) (i.e., the probability of a fragment given a transcript) is independent of length(t). There's currently no option to disable length normalization completely in Salmon, and you can't "de-normalize" by simply multiplying by a factor because those weights are considered during each and every round of the EM (or VBEM) algorithm.
However, supporting this should actually be very straight-forward. We simply assign a uniform and identical length to all transcripts for the purpose of inference. I can add such a flag in the next release, though it will initially have to be incompatible with bias correction (since it's not clear right now how the biases for which we account interact with this type of sequencing).
Also, it would be possible to run salmon with --dumpEq
, and then to have a little script / tool that simply re-runs the EM, but without different length factors, using the equivalence class file. I might be able to hack something like that together on short notice if you'd be interested in testing it out.
--Rob
Hi @rob-p ,
thank you very much for your quick and detailed answer. I really appreciate that you will include this feature in your next release. Indeed, I'm interested in testing out your suggested script/tool !
Best, Jan
This may also be applicable to 10x Genomics reads...
@maximilianh If you ignore the UMI's yes.
rapmap -> dedupe umis -> feed deduped alignments back into Salmon would be pretty awesome.
I can see the point of deduping reads based on UMI's. But then you'd need to do it on read level rather than mapping level (i.e. CIGAR-free alignment). Maybe then doing it with STAR rather than RapMap? I think I even saw someone who changed Picard Dedup at some point to do this based on UMI's.
Hi @vals, could you expand I bit more on what distinction is important between the read level and the mapping (cigar-free aln) level? Is it just that one read could yield many mappings, not to be considered as dups, or something else?
Hi @rob-p ,
Yes that's one aspect. But also, Salmon uses CIGAR to evaluate alignment probability in alignment quantification mode no?
And with just RapMap output you would lose other information that Salmon uses to determine likely fragment assignment?
With UMI's you can deduplicate fragments before inferring where they were likely to come from.
Ideally you would deduplicate the reads directly based on UMI, then you wouldn't have to think about PCR duplication in the quantification. But of course keeping a hash of all reads in a FASTQ and accounting for dequencing errors wouldn't be really tractable..
Yes that's one aspect. But also, Salmon uses CIGAR to evaluate alignment probability in alignment quantification mode no?
Indeed.
And with just RapMap output you would lose other information that Salmon uses to determine likely fragment assignment?
You would lose information (in the format of a CIGAR string) that Salmon uses in alignment mode, but not any information, I think, that Salmon uses in quasi-mapping-based mode (though one would incur a non-trivial performance hit for filtering the quasi-mappings through file / disk rather than dealing with them directly in memory as Salmon normally does).
With UMI's you can deduplicate fragments before inferring where they were likely to come from. Ideally you would deduplicate the reads directly based on UMI, then you wouldn't have to think about PCR duplication in the quantification. But of course keeping a hash of all reads in a FASTQ and accounting for dequencing errors wouldn't be really tractable..
I guess this is the real question I have. Specifically, what is the true computational burden to detect and eliminate duplicates using UMIs? In theory, the reads must (1) map to the same location and (2) have the same UMI tag. How often would one expect the UMI tag to be modified / corrupted / etc.? Would you have to search all 1 or 2 hamming distance neighbors to detect duplicates reliably? Is an equivalence class a sufficient proxy for "mapping to the same location", or do we also care that e.g. the position of the fragment within each transcript is a duplicate as well? These are the main questions that are preventing me from implementing the "obvious solution".
I don't think sequencing errors of the UMI's is a big issue. I was thinking more of the reads.
I've been thinking in terms of (read, UMI) pairs and (transcript, UMI) pairs, and what the relation between them is.
Now, during mapping, an equivalence class get a count every time a read is mapped to it right? How feasible would it be to keep track of all UMI's which have been assigned to an equivalence task?
Counting (transcript, UMIs) is what "kallisto pseudo" with the --umi option does, right?
Yes, there are a few errors in the UMIs. The Kallisto wrapper tries to correct them. But this is really very rare. Not sure if it's worth the time. Like ~100 out of 200.000 reads? I would have to check again.
Does it actually do counting? I thought it stopped at (Eq.Class, UMI) counts.
Oh Sorry you're right: it does only (equality class, Umi count)
On Dec 23, 2016 1:16 PM, "Valentine Svensson" notifications@github.com wrote:
Does it actually do counting? I thought it stopped at (Eq.Class, UMI) counts.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/COMBINE-lab/salmon/issues/108#issuecomment-269046453, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TRKT9ot6qSR96TfF1Tzur4J1t2uIks5rLDoSgaJpZM4K_U-P .
On Dec 23, 2016, at 3:35 PM, Valentine Svensson notifications@github.com wrote:
I don't think sequencing errors of the UMI's is a big issue. I was thinking more of the reads.
I've been thinking in terms of (read, UMI) pairs and (transcript, UMI) pairs, and what the relation between them is.
Now, during mapping, an equivalence class get a count every time a read is mapped to it right? How feasible would it be to keep track of all UMI's which have been assigned to an equivalence task?
It would be almost trivial. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
All UMI/Tag data is single-ended, so only the SE model of quantification would apply (meaning no fragment length inference). Also positional bias and and effective length should be kept constant in this data. And the GC bias doesn't work with SE anyway. So I think a reasonable strategy would be to only count unique UMIs in Equiv.Classes and pass that to the SE Salmon model. That is, if a read has a UMI which is already present in the EC, just throw it away without counting.
Or is there something else in the single-end Salmon model I'm forgetting that would be affected?
+1 for the deduping on EC plan.
+1, especially for throwing away reads with an UMI that has been seen before.
OK, so it seems like adding the UMI <-> eq class counting should be pretty straightforward. @vals, so when you say effective length should be kept constant, you mean we shouldn't adjust for bias, or that we don't expect a transcript length effect at all? Also, is there a basic primer I can read up on regarding the details of the data format (e.g. where in each read I should look for the tag, how long the tag is expected to be, etc.)? I'll be relying on you guys to test stuff out and point me at relevant data sets as I go forward with implementing this.
Hi Rob,
All of the different single-cell library prep methods put the cellular barcodes and UMI in different places; what we've been doing is sticking that information in the read name so we have access to it post alignment. Valentine has this awesome repository: https://github.com/vals/umis that does that part via fastqtransform, and has example regexes for a bunch of the different commonly used chemistries:
https://github.com/vals/umis/tree/master/examples
It might be easiest to support the format that Vals is spitting out and assume the user has stuck that information in the read name as a first go rather than re-implement that stuff.
Hey Rob,
I mean there shouldn't be an effect from transcript length. I'm sure there will be other biases, but which other biases can you account for in single-end data?
Yeah unfortunately I think my list of examples might be the best resource for the common file formats. I realize the regular expressions are not extremely parsable. I might make a cleaner visualization of the read layouts (with references) if I feel like playing with Illustrator for a while, but it's a lot going on now.
Hi Valentine,
Thanks, I'm taking a look through your umis repo that Rory pointed me at. The fact that there is no length effect (i.e. that fragments are drawn from transcripts independent of transcript length) makes the model a little bit different than even the single end abundance model. There, you still expect longer transcripts to generate more fragments _a priori, but to ignore transcript length completely, you essentially want to just treat every transcript as if it has the same effective length. That's not hard to do in the existing model, it just requires setting the length differently in a few places.
Edit: It also looks like, in some of those examples, the transformations are being done on _1
, _2
file pairs. Does this have a different meaning in UMI data, or is something else going on here?
Hey Rob,
Regarding file pairs, in some cases e.g. the FW read contains the mRNA Tag and a Cell Barcode, while the RV read contains the UMI and some non-informative sequence. In particular in the 10X Chromium (V1) protocol there are 4 files! FW, RV as well as I5 and I7 index reads. I don't remember exactly which read contains what information right now though.
The Klein group has a new inDrop protocol that has the 4 file format as well that should be cropping up in publications sometime soon. Not sure if it is nicked from the 10x folks but it is:
{
"read1": "(?P<name>[^\\s]+).*\\n(?P<seq>.*)\\n\\+(.*)\\n(?P<qual>.*)\\n",
"read2": "(.*)\\n(?P<CB1>.*)\\n(.*)\\n(.*)\\n",
"read3": "(.*)\\n(?P<SB>.*)\\n(.*)\\n(.*)\\n",
"read4": "(.*)\\n(?P<CB2>.{8})(?P<MB>.{6})\\n(.*)\\n(.*)\\n"
}
SB = sample barcode, CB1 = first part of the cellular barcode, CB2 = second part of the cellular barcode, MB = molecular barcode, seq = biological sequence
Cool, thanks for clarifying that; it makes sense. So now my question becomes, when we have a stream of reads with all of this information encoded, what should the output look like? That is, if I have a stream of reads where each read has a cell barcode and a umi (with that cell barcode), then, presumably, the output estimates we want are per-cell (i.e. the estimate of the number of UMI de-duplicated reads coming from each transcript within each cell). Is this correct?
Yup, exactly right.
Ok so 0.8.0 contains the option to turn off all length correction, but I've pushed back the milestone for fully supporting the barcoding. This is because there's already a lot of new stuff in 0.8.0 and I needed to cut a release to coincide with the paper. Just dropping in here to explain the modified milestone and re-iterate my interest in and commitment to this feature ;P.
Thanks Rob, being able to quantitate UMI tagged data in a non-crappy way would be awesome. Right now we are doing something that is FeatureCounts-like for the most part.
Just chiming in, we have bulk RNA-seq data using a NuGen protocol that has the UMI (for paired end data) in the index read. This is very common in the DNA-seq circles so not a one-off protocol. Would be nice if UMIs could be used for PE reads in distinguishing duplicates.
Thanks Rob, for including the option to turn off length correction. Can I say that with this option turned on, a multi-mapping read will get divided equally among all the targets?
Hello, i'm currently dealing with RNA-seq data generated with Lexogen QuantSeq, as ana laternative method to microarrays (https://www.lexogen.com/quantseq-3mrna-sequencing/). Briefly, it's a kit to generate libraries of sequences close to the 3'-UTR with poly-dT-primers in the first PCR. Thus, only one specific fragment is amplified from a transcript. Therefore, the number of generated reads is said to be independent of the transcript length.
This leads to my question, wether there is a possible to quantify QuantSeq-data with salmon without a transcript-length-based normalisation or to remove the length normalisation afterwards by some "de-normalisation"- factor ?
Thank you very much for your help !
Best, Jan