Open pcm32 opened 5 years ago
Thanks @pcm32 and @pinin4fjords for checking . Just to put this into more context as to why this is the case: In droplet based single cell protocols Cellular Barcoding happens before illumina sequencing. The library is mixed all together then separated in-silico using the assigned barcodes. However, in the sequencing phase, at least in my understanding, the single-cell protocols won't separate each cellular barcode disjointly i.e. in theory, the reads from a single cellular barcode can be uniformly (in expectation) spread and sequenced across all the used lanes.
I think it's rather the single-cell protocol specific issue rather than alevin specific as the process of UMI deduplication is critical for having the read sequence. Imagine a UMI (A) with 8 PCR duplicate of 1 molecule and supposedly sequenced on 2 lanes (L1 and L2). In expectation you'd expect each lane to have 4 PCR duplicates. Now when you deduplicate two lanes separately then you'd end up predicting 2 molecules while in truth there was only one and you'd be able to deduplicate if processed the full (both lanes) data together.
I assume, that's what led to @pinin4fjords 's quant merge
search. In theory yes one should be able to merge the data from two independently processed cells but you may need the UMI network level information rather than gene counts, as the UMI level network is already lost by the time you've generated the gene counts. Unfortunately, we have no command for that yet.
Hope it help and feel free to ask any other questions you may have.
According to https://github.com/COMBINE-lab/salmon/issues/434#issuecomment-540533728 the current behaviour of the Alevin wrapper in which each fastq file provided in a collection is analysed separately and then merge is incorrect; all lanes should be processed jointly and quantmerge shouldn't be used.