Star, Strandedness - Githubissues

laudycherry commented 6 months ago

Hello,

I am analyzing some RNA samples and I have noticed an issue with the quantification part using Star.

The data I have was generated by an external company using TruSeq3-PE-2 library, which as I know it is a stranded library kit. To do the alignment and quantification by STAR I have used the following script:

STAR --genomeDir ~/genome/STAR \ --runThreadN 8 \ --readFilesCommand zcat \ --readFilesIn /home/laudy/data/raw/Sample_1.fq.gz /home/laudy/data/raw/Sample_2.fq.gz \ --quantMode GeneCounts \ --sjdbGTFfile ~/genome/Mus_musculus.GRCm39.110.gtf \ --outFileNamePrefix /home/laudy/data/ReadCounts/S18_Musmusculus.GRCm39.110

After running the samples I have noticed that the ReadsPerGene.out generated from Star (please find the file attached, ReadsPerGene.out.txt has similar number of counts in column 3 and 4 , strand 1 and strand2) which is not compatible with the strandedness of the library used (as I know if the library used is stranded I should obtain higher counts in one of the strand over the other). I have tried to use other script to solve this issue but still encountering the same problem.
I need to do adapter trimming with trimmomatic, is it possible to do that then use Star for quantification. I have tried to use the output files generated by trimmomatic in Star but it seems Star accepts only fq. samples, in this situation what is the best way to trim adapters then run Star.
What is the best tool that can be used to determine the strand of each gene after quantification? Thank you in advance for your response,

alexdobin commented 6 months ago

Hi @laudycherry

If columns 3 and 4 have similar counts, it likely means that the library is not stranded. Then, the only way to infer strand is by using spiced intron motifs. STAR should work fine with pre-trimmed reads, provided that the trimmer keeps the reads in the same order in Read1 and Read2.

laudycherry commented 6 months ago

Dear @alexdobin, I appreciate your answer, I have checked the library used, it is trueseq stranded, still I am getting similar counts in columns 3 and 4. Honestly I cannot understand why I am getting similar counts in both. what shall I do in this case? is it acceptable to run the data from column 2 in Deseq2 then? On the other hand I run cufflinks after running Star to determine the strands, unfortunately, Cufflinks does not provide the ensembl gene id, is there any tool that permits to convert cufflink ids to ensembl gene ids? Thank you in advance for your response,

alexdobin commented 6 months ago

Hi @laudycherry It is possible that the library is unstranded - a mistake made by the sequencing provider. It is acceptable to use unstranded counts in this case, but you will miss the genes that overlap on opposite strands.

alexdobin / STAR

Star, Strandedness #2093