alexdobin / STAR

RNA-seq aligner
MIT License
1.81k stars 500 forks source link

STARsolo CB UMI Adapter trimming from reads with cDNA sequences #1631

Open ttn1883 opened 2 years ago

ttn1883 commented 2 years ago

My libraries were build using the 10X Chromium Next GEM Single Cell 3ʹ Reagent Kits v3.1.

My Read 1 sequence looks like this: @A00738:420:HY77NDSX3:3:1101:1217:1000 1:N:0:CGGAACCCAA+TCCTCGAATC NGCTCCATCCATCTCGGTCTTCAATTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGAGCTGTAAAAACCCCGGATGCAGGCACCCAAAATACCCACCAATAAAAACAAATTAAATAATTTAAAATTGACACATAAAAATTAAAGAA

When I used --soloBarcodeReadLength 150.​ I got 68.4% but basically, I am ignoring the data from Read 1.

When I tried: --soloType CB_UMI_Simple --soloBarcodeMate 2 --clip5pNbases 58 2 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12, alignment went down from 68% to 4%.

I input my Read 2 first and Read 1 second. Read 1 includes the CB/UMI/Adapter/cDNA sequences.

  1. Which of my STARsolo parameters are incorrect?
  2. Is my --clip5pNbases 58 2 correct? Do I ignore CGGAACCCAA+TCCTCGAATC ? Do I ignore TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT?

Attached is the log.out output.

Thank you for your input.

alexdobin commented 2 years ago

Hi @ttn1883

For the standard 10X 3' libraries, please use parameters as described here: https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md

Cheers Alex

ttn1883 commented 2 years ago

Hi @alexdobin

I followed the exact same guide but my alignment was 4%. For the 3' protocol, CB is 16 and UMI is 12. and the dTs is 30.

Therefore, I used --clip5pNbaes 58 2 and --soloBarcodeMate 2 to specify that mate2 is the one with the CB/UMI.

Thank you

alexdobin commented 2 years ago

Hi @ttn1883

with standard parameters, you need to use --readFilesIn Read2 Read1, i.e. the first file should be cDNA read and the 2nd file should be the barcode read.

ttn1883 commented 2 years ago

Hi @alexdobin

I did that as well. Thank for your time to troubleshoot. Please see my complete codes below:

STAR --genomeLoad NoSharedMemory --outSAMattributes All --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --readFilesCommand zcat --runThreadN 16 --sjdbGTFfile /GCF_016700215.2_bGalGal1.pat.whiteleghornlayer.GRCg7w_genomic.gtf --outReadsUnmapped Fastx --outMultimapperOrder Random --genomeDir /STAR_indices/ --readFilesIn /R2_001.fastq.gz /R1_001.fastq.gz --outFileNamePrefix /Taylor_012 --soloCBwhitelist /3M-february-2018.txt --soloType CB_UMI_Simple --soloBarcodeMate 2 --clip5pNbases 28 2 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12

alexdobin commented 2 years ago

Hi @ttn1883

in the latest parameters set, please remove --soloBarcodeMate 2. Also, I am not sure why you are using --clip5pNbases 28 2

ttn1883 commented 1 year ago

I removed --soloBarcodeMate 2 and --clip5pNbases 28 2 and received the following error. Will try adding --soloBarcodeReadLength 0 next.

STAR --genomeLoad NoSharedMemory --outSAMattributes All --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --readFilesCommand zcat --runThreadN 16 --sjdbGTFfile /GCF_016700215.2_bGalGal1.pat.whiteleghornlayer.GRCg7w_genomic.gtf --outReadsUnmapped Fastx --outMultimapperOrder Random --genomeDir /STAR_indices/ --readFilesIn /R2_001.fastq.gz /R1_001.fastq.gz --outFileNamePrefix /Taylor_012 --soloCBwhitelist /3M-february-2018.txt --soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12

STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source

Aug 31 12:09:02 ..... started STAR run Aug 31 12:09:03 ..... loading genome Aug 31 12:09:06 ..... processing annotations GTF Aug 31 12:09:13 ..... inserting junctions into the genome indices Aug 31 12:09:31 ..... started mapping

EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 150 not equal to expected 28 Read ID=@A00738:420:HY77NDSX3:3:1101:1217:1000 ; Sequence=NGCTCCATCCATCTCGGTCTTCAATTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGAGCTGTAAAAACCCCGGATGCAGGCACCCAAAATACCCACCAATAAAAACAAATTAAATAATTTAAAATTGACACATAAAAATTAAAGAA SOLUTION: check the formatting of input read files. If UMI+CB length is not equal to the barcode read length, specify barcode read length with --soloBarcodeReadLength To avoid checking of barcode read length, specify --soloBarcodeReadLength 0 Aug 31 12:09:31 ...... FATAL ERROR, exiting

ttn1883 commented 1 year ago

Removing -soloBarcodeMate 2 and --clip5pNbases 28 2 and then adding --soloBarcodeReadLength 0 works. However, it is essentially the same as --soloBarcodeLength 150. Meaning that we are completely ignoring the cDNA is Read1. So far no solution allows using both Read1 and Read2 cDNA for mapping.