Alevin: Adding support for CEL-Seq

PeteHaitch commented 6 years ago

I've just started working in a single-cell genomics core facility, alevin looks really useful for our 10X runs! But I also have a substantial number of samples processed with the CEL-Seq and CEL-Seq2 protocols (also 3'-tag protocols). I'm interested in adding support for these protocols to Alevin.

Is that something you'd be interested in incorporating?
Any hints on where to start messing around in the code to implement this would be much appreciated!

Cheers, Pete

PeteHaitch commented 6 years ago

Ah, I just saw https://github.com/COMBINE-lab/salmon/issues/247. I'll experiment with this first.

k3yavi commented 6 years ago

Hi @PeteHaitch , Thanks for your interest in Alevin. Although in current Alevin we have concentrated mainly on learning more about Droplet based 3'-tagged single cell protocols, especially 10x; we are very much interested in extending it towards other protocols like CEL-seq. However, there are couple of challenges/difference which should be considered before incorporating it into the Alevin pipeline. Currently Alevin relies on the fact that the droplet based protocols use PCR amplification of the library and the UMI deduplication phase of Alevin assumes an exponential model, I am not sure how true is this with CEL-seq? Another issue is that CEL-seq is a Fluidigm based system while the current application for Alevin is for microfluidics based. In general we have observed that the 10x cell isolation step is pretty robust in reporting the Cellular Barcodes(CB) and although we have a probabilisitic model to handle the CB based uncertainty but the ambiguous case like that are very less frequent, (although not true for Drop-Seq). Having said that, we might have to do some analysis to actually figure out the right model for Barcode correction in Fluidigm based system.

Also, please do let us know of your experience in using the solution proposed in #247 . Looking forward to hearing back from you.

PeteHaitch commented 6 years ago

Toying around with the solution in #247, I think I've found why it's not yet working for me. The CEL-Seq2 read1 is UMI + CB whereas 10X is CB + UMI (to my understanding). Is there an easy way to tell alevin the order is reversed or will I need hack around some?

k3yavi commented 6 years ago

Hey @PeteHaitch , I think currently there is no direct way to tell Alevin to use CB and UMI in reverse order and you might have to hack a bit for that. Although it should not be too hard to do that. Specifically, the extractBarcodes and extractUMI function here has to be updated with a new generic type (celseq may be). Let us know if it works out for you otherwise I can take a look into this sometime next week.

k3yavi commented 6 years ago

Hi @PeteHaitch , I have just pushed a potentially testable version in Alevin for cel-seq2 ( activated by --celseq command line flag ), although to make it work the develop branch has to compiled from source. A couple of points to note:

I assumed the the length of both CB and UMI to be 6 as in the original cel-seq2 paper.
The deduplication algorithm is still same as default and nothing has been changed in the part.

Please let us know how it works out for you and if at all it's useful / comparable to the output generated by the traditional cel-seq2 pipeline.

PeteHaitch commented 6 years ago

Oh, I should've pushed my PR sooner! Thanks! I'll take a look how it compares to what I did. One thing to note is that it'd be useful to be able to specify the length of the CB - we use 8 bp in our slightly-adapted CEL-Seq2 protocol.

rob-p commented 6 years ago

Hi @PeteHaitch! I agree with @PeteHaitch here --- I think we should provide an easy way to specify custom cb & umi parameters paired with a particular protocol. For 10x v2, since it's a very standard commercial protocol, I think simply having a --chromium flag is probably OK. But we should make it easy for ppl to tweak their CB & UMI lengths.

k3yavi commented 6 years ago

@PeteHaitch Thanks for making the pull request and correcting the barcode length for the celseq2 protocol. We'll review it soon and merge it to the develop (which will be merged to master in next release).

@rob-p I think we already have that capability of specifying the CB and UMI length, it's just CelSeq2 was little difference in the order of them. Basically the flags like --chromium or any other protocols are wrapper around using the standard CB and UMI lengths. If one wants a customization we can always use --umiLength and --barcodeLength. I am thinking of tweaking the --end part of the struct to select the order of the CB and UMI which incase of CelSeq2 is reverse.

k3yavi commented 6 years ago

latest commit https://github.com/COMBINE-lab/salmon/commit/093b5a98e16cab7c3934c0a7c222549644c39728 will generalize the write_fastq for all the protocols. @PeteHaitch Thanks again for making the pull request, do let us know how does the quantified matrix looks at the end for the Cel-Seq2 protocol or what more we can do in Alevin to help improve the results.

Closing this issue for now but feel free to open it again if have any other problem.

PeteHaitch commented 6 years ago

Thanks, @k3yavi! I'll be sure to share my experience and any comparisons I perform.

COMBINE-lab / salmon

Alevin: Adding support for CEL-Seq #269