alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

SmartSeq2 with STARsolo gives only 1 cell #1477

Closed abhisheksinghnl closed 2 years ago

abhisheksinghnl commented 2 years ago

Dear Community,

I am trying to run STARsolo on my SmartSeq2 samples. I am using following command

STAR --runThreadN 16 --genomeDir /home/jupyter/HumanGenomes/ --readFilesCommand zcat --readFilesIn HRR204645_f1.fq.gz HRR204645_r2.fq.gz --soloType SmartSeq --outSAMtype BAM Unsorted --outBAMcompression -1 --soloUMIdedup Exact --outSAMattrRGline ID:sample1

The process doesn't show up any error. However, summary.csv file highlights that I have only 1 cell.

"Estimated Number of cell 1"

I might be missing something but I am not sure what I am missing.

Could anyone suggest something or help.

Thank you

alexdobin commented 2 years ago

Hi @abhisheksinghnl

in SmartSeq-like protocols, each pair of Read1/2 FASTQ files corresponds to one cell. Since you supplied only one pair, you get one cell in the output.

Cheers Alex

abhisheksinghnl commented 2 years ago

Hi @alexdobin,

Thank you for your reply. These are pretty huge files in the range of 22 to 25Gbs.

Here is a snapshot of one of the file in question.

R1

@A00877:307:H5H77DSXY:2:1101:1325:1000 1:N:0:TGACCAAT
GNAGGGAGACGTCTACATCTGCCAAGTGGAGCACACCAGCCTGGACAGTCCTGTCACCGTGGAGTGGAAGGCACAGTCTGATTCTGCCCGGAGTAAGACATTGACGGGAGCTGGGGGCTTCATGCTGGGGCTCATCATCTGTGGAGTGGG
+
F#FFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFF,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFF:F:F
@A00877:307:H5H77DSXY:2:1101:1380:1000 1:N:0:TGACCAAT
TNCTAGCAGCATTGGCCTTGGCAAGTCACTGGTAACTGTTTTCTGTAAAGCAGAGGTTGCCCACTTCATTAGACTGTAAGAACTGAATGAGAAAAGAGTAGGAGAGTACTCTGTAAACACAAGTGATAGGGAAGTTACCATCACCACTCC
+
F#FF:FFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFF
@A00877:307:H5H77DSXY:2:1101:1524:1000 1:N:0:TGACCAAT
GNGATGGGTCTTGCTATGTTGCACAGGCTGGTCTTGAACTCCTGGATTTAAGTGATCTTTTTACTCTAAAATGTAATCTAAATAATAACAAAATAAATATTGAGCTGAAGAAAAGAAAAAGGAAACAGTGATTTTCATGTCTGCTATGTG

R2

@A00877:307:H5H77DSXY:2:1101:1325:1000 2:N:0:TGACCAAT
TGGCTTCATGCAGGAGTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTGTTTTTTATTTTTTGTGGTAAAAATAAGGGAAAAAGGTTGTAGTCAAAGTGTTAGTTAAAGTGGATTGATAAAAAAAGCAAAAATTTATAAAATAAGATAATAG
+
FFFFFFFFFFFFFFF:F,F:,FF::FFF,F:FFF:FF:F:FF,F:F,FFF::F:FF::F,FFF,,,,,F,,FF,:F,F,,FF,:::,FF,F,,F,:,,,:,,,F,,,F,,,FF,,,FF:F:,F,FF,,,,,,:F,:,FF,:,,,,,,,:,
@A00877:307:H5H77DSXY:2:1101:1380:1000 2:N:0:TGACCAAT
GATAGACAGTGACGCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAAGTTTTGGGGTTGGTAACTAACATAAAAATTGTTTTTAAATTGTAATAAACAAAAATTTTAAAATAAAATAATTACATTAAAATTAATGTGCCAACCAATGGTT
+
FFFFFFFFFFFFFFFFF:FFFFF,FFFF:,FFFFFFFFF:FFF,,F,,,,,,:F,,,,,,,F:,:,,,,,F,,,,:F:,FFF:,,,,:,,,,,,F,F,,,,,F,FF,,F,:,,,:,,,::,,,:,FF,F,,,,,,F:,:F,,,,,,,,,F
@A00877:307:H5H77DSXY:2:1101:1524:1000 2:N:0:TGACCAAT
GGACTATATTTACATTGTGGCTTTGCCATTTTCTAGATTTTTTTTACTTTGGACAAATTATTTAAACTCTTTGAACCTCATTGTTCTCATCTGTGCAGATGATGCTCACTTCAGAGAAGATGACGCACATACAACACATTAAACCTAGTG

If you could confirm, whether I am missing something or not.

Best regards Abhishek

alexdobin commented 2 years ago

Hi Abhishek,

if the files are so large, it's possible that they have not been demultiplexed by cell. There is the 8b index sequence at the end of the read name, which may correspond to different cells, then you could split the files by that index. But you would need to confirm it with people who sequenced this library.

Cheers Alex

abhisheksinghnl commented 2 years ago

Hi @alexdobin ,

thank you again for your reply it does need to de multiplexed. I am new to this SmartSeq platform and I have no idea how can I do it. Could you please point me to the direction of a thread or a small chunk of code that could do this for me?

Thank you

alexdobin commented 2 years ago

Hi Abhishek,

usually, the standard Illumina pipeline does the demultiplexing. If you have to do it yourself, you would need to have index-to-cell correspondence table. Note that some index sequences contain errors, and so should be discarded. So at the very least, you need to know the list of indexes used in the library generation.

If each index corresponds to just one cell, it's a relatively simple parsing exercise: each read (4 lines) is output in the file named according to the index sequence. With awk, it would be something like this:

awk '{ file1=substr($2,7) "_R1.fq"; print > file1; for (ii=2;ii<=4;ii++) {getline; print > file1} }'    Read1.fq

and similar for Read2.

Cheers Alex

abhisheksinghnl commented 2 years ago

Thank you soo much