alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

STARsolo for processing the Microwell-seq #647

Open zhou-ran opened 5 years ago

zhou-ran commented 5 years ago

Hi, Alex

Apologies if I have missed this somewhere, and I am trying to process my 10X data with STARsolo. This is very convenient. But if there were more than one barcode and UMI molecular in my scRNA-seq data like microwell-seq, which there were two link sequence in the barcode reads and barcodes were not continuous. STARSolo can process this data?

Thank you for any reply

Ran

alexdobin commented 5 years ago

Hi Ran,

at the moment, you would need to preprocess the reads (say using python regex package, as UMI-tools does) to create the two reads similar to Drop-seq/10X, with one read containing all parts of cell barcode concatenated together, and UMI. Only constant barcode lengths are supported.

I am gearing up to implement complex barcode configurations, so if you can tell me how exactly the microwell-seq barcodes look like, I will try to include it.

Cheers Alex

zhou-ran commented 5 years ago

Hi Ales, That's great for including the configurations for complex barcode.

In microwell-seq, the barcodes was linked by two linker sequence, and the real barcodes was located at 1-6:22-27:43-48, and umi at 49-54. you could test on the SRR6954503 file.

Thanks Ran

alexdobin commented 5 years ago

Hi Ran,

thanks! So the CB and UMI can be extracted from the fixed positions, no need to search for linker sequences? This makes it a bit easier.

Cheers Alex

zhou-ran commented 5 years ago

Hi Alex,

Yep! actually it's the parameter used in Dropseq tools.

Thanks Ran

zhou-ran commented 5 years ago

Hi Alex, In STARsolo mode, why we need add the --soloBarcodeReadLength? For example, STARsolo can't work well with 10X fastq after qc, because not all the R1's length was 150bp. how about only care the length of CB and UMI must smaller than the read length? Thanks Ran

alexdobin commented 5 years ago

Hi Ran,

in --readFilesIn, the cDNA fastq needs to be supplied first, and barcode read - second, e.g. for 10X fastqs, where cDNA read is R2, and barcode read is R1, the files need to be supplied with --readFilesIn R2.fq R1.fq STAR will map them as single-end reads. The cDNA read can be trimmed and have variable length, its length is not checked.

The barcode read should not be trimmed at all. However, you can specify --soloBarcodeReadLength 0 to prevent STAR from checking its length.

Cheers Alex

lin-zhongbao commented 3 years ago

Hi Alex: Sorry I donot understand the "Barcode geometry" for complex barcodes. in --soloAdapterSequence, seems that we can only input one adapter sequence to anchor barcodes. how about the barcode mode is BC1+Adaptor1+BC2+Adaptor2+BC3+Adaptor3+BC4+UMI+PolyT? Thanks Tao

alexdobin commented 3 years ago

Hi Tao,

what are the lengths of the BC1-3 and Adaptor1-3?

Cheers Alex

lin-zhongbao commented 3 years ago

Hi Alex:

BC1=8bp BC2=8bp BC3=8bp BC4=8bp UMI=12bp Adaptor1=30bp Adaptor2=25bp Adaptor3=20bp

Thanks Tao

alexdobin commented 3 years ago

Hi @lin-zhongbao

since the adapters and barcodes all have constant lengths, you should be able to process it with:

--soloType CB_UMI_Complex   
--soloCBposition   0_0_0_7   0_38_0_45   0_71_0_78   0_99_0_106   (barcode coordinates)
--soloUMIposition  0_107_0_118 (umi position)
--soloCBwhitelist WL1.txt WL2.txt WL3.txt WL4.txt   (these whitelists for the 4 barcodes).

In each x1_x2_x3_x4 tuple, x1=x3=0 indicates that you are measuring distances from the read start x2=barcode start, x3=barcode end, both zero-based - please check my calculations.

This approach should recover most of the barcodes. However, indels may screw up the distances of the farther barcodes. In this case, we can try to use the 2nd adaptor sequence to align the 3rd CB, 4th CB and UMI, something like

--soloType CB_UMI_Complex   
--soloCBposition   0_0_0_7   0_38_0_45   3_1_3_8   3_29_3_36   (barcode coordinates)
--soloUMIposition   3_37_3_48
--soloAdapterSequence Adaptor2sequence
--soloCBwhitelist WL1.txt WL2.txt WL3.txt WL4.txt   (these whitelists for the 4 barcodes).

x1=x3=3 indicates that we measure the distance from the adapter end.

Please let me know how it goes. If it does not works, please send me a few hundred barcode reads (good ones, from the middle of the fastq), and also the whitelist files.

Cheers Alex

lin-zhongbao commented 3 years ago

Hi Alex:

Yes, It works very well. Thank you very much.

So now I understand that we can use only one adaptor's start(end) anchor in the parameter "--soloCBposition".

For the muti-adaptor situation, indels may be exist in any adaptor or barcode.

Could you please extend the parameter "--soloCBposition" if you have time?

like this: start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter1 start; 3: adapter1 end; 4: adapter2 start; 5: adapter2 end; 6: adapter3 start; 7: adapter3 end;

so for the complex barcode mode BC1+Adaptor1+BC2+Adaptor2+BC3+Adaptor3+BC4+UMI+PolyT,

then we can try to use all the three adaptor's sequence to align, something like

--soloType CB_UMI_Complex   
--soloCBposition   0_0_2_-1   3_1_4_-1   5_1_6_-1   7_1_7_8   (barcode coordinates)
--soloUMIposition   7_9_7_20
--soloAdapterSequence Adaptor1sequence Adaptor2sequence Adaptor3sequence
--soloCBwhitelist WL1.txt WL2.txt WL3.txt WL4.txt   (these whitelists for the 4 barcodes).
alexdobin commented 3 years ago

Hi Tao,

at the moment only one anchor adapter can be used, so you need to choose which of the three adapters you will use as the anchor.

Note that no indels are allowed in the barcodes or the anchor adapter, so by using one of the adapters as an anchor you will only recover the reads with indels in the other adapters. I am not sure if it's going to give many additional reads compared to the simple scheme where you anchor barcodes to the beginning of the read.

Cheers Alex