Bustools with sci-rna-seq 3

davemcg commented 3 years ago

https://teichlab.github.io/scg_lib_structs/methods_html/sci-RNA-seq3.html

I must admit I'm fairly obtuse about using kallisto bustool's bc:umi:seq notation for "off label" uses.

Could someone give me some pointers how I could use kallisto to process data from this tech? Used most notably here: https://www.nature.com/articles/s41586-019-0969-x

agalvezm commented 3 years ago

Hello,

Yes, it is possible to use kallisto to process sci-RNAseq data. What makes this technology a bit harder to implement is the variable barcode length (with some cells having a 19bp barcode, and others a 20bp one). We have an internal unreleased kallisto version that deals with this problem, but in the meantime, there is a solution which works exactly the same:

1) Pre-process the reads: The UMI and barcodes are in the R1 fastq file. By using the sequence of the constant linker between the two parts of the barcode, we can detect whether we are dealing with a 19 or a 20bp one. We can then modify the 19bp barcodes by adding a nucleotide, and in doing so make every barcode 20bp long. You can achieve this with the following command: zcat R1.fastq.gz | sed -e 's/$^[A-Z]\{9\}$$CAGAGC$$[A-Z]\+$.$/\1C\2\3/g' > R1_mod.fastq I made sure that adding the nucleotide did not lead to any overlap (this is, no existing 10bp sub-barcode can be obtained adding a "C" to a 9bp one).

2) Modify your whitelist accordingly Because you are adding a "C" to every 9bp sub-barcode, you need to reflect that in the whitelist. You can modify the list of the ligation barcodes as follows:

# Add a C at the end of every 9bp sub-barcode to make it 10bp long
ligation_bcs_lengthened = [bc + "C" if len(bc) == 9 else bc for bc in ligation_bcs[1]]

And then build your whitelist with this modified list and the unmodified second barcode list.

3) Run kallisto with the modified sci-RNA-seq technology You can find that version of kallisto here: https://github.com/agalvezm/kallisto/tree/sci-RNA-seq

Another option is to modify the reads as you give them to kallisto as follow:

kallisto bus -i index.idx -o output/ -x SciRnaSeq -t 2 <(zcat R1.fastq.gz | sed -e 's/$^[A-Z]\{9\}$$CAGAGC$$[A-Z]\+$.$/\1C\2\3/g') R2.fastq.gz

I have prepared a google Colab with this method that you can use to process your data: https://colab.research.google.com/drive/1ZAAdccLmqOU1guYxm2saEh3c7fs7M6ye?usp=sharing

You just need to add the links to your fastq, as well a the links to the ligation and RT barcode sequences. If you don't have that information, you can generate a whitelist using bustools whitelist and therefore skip step 2. I am happy to help you do this if you have any problems.

Please let me know if you have any issues or questions! Hope this was helpful.

davemcg commented 2 years ago

Thank you so much for your help! I was unable to implement your suggestion as ... other stuff ... happened. I was about to pick this side project today and noticed that kallisto was updated to directly address this issue. Thanks everyone!

https://github.com/pachterlab/kallisto/releases/tag/v0.48.0

davemcg commented 2 years ago

Oops, just noticed that SmartSeq3 != sci-rna-seq3. I expect that the workaround above should work! I just wanted to comment to ensure other people stumbling here did not get confused.

BUStools / bustools

Bustools with sci-rna-seq 3 #75