alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 497 forks source link

single cell RNA-seq STAR mapped % of reads unmapped: too short is huge #1332

Open 10KGenomics opened 2 years ago

10KGenomics commented 2 years ago

Dear sir: Hello, I used the drop SEQ process to analyze the single-cell transcriptome data and used star for comparison, because the single-cell transcriptome r1.fastq in drop SEQ is barcode and UMI sequence. There is no need to compare. But only 150bp r2.fastq need to map. My instruction is: Star -- genomedir $star INDEX --runThreadN ${Thread} --readFilesIn sample R2.fastq --outFileNamePrefix sample- this is one of the log files : image Later, I saw the issues analysis of GitHub. I need to modify the parameters -- outfilterscoreminoverlread and -- outfiltermatchnminoverlread. I set both parameters to 0.3, and my instruction is: Star -- genomedir $star INDEX --runThreadN ${Thread} --readFilesIn sample R2.fastq --outFileNamePrefix sample- --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 ,this is one of the log files : image In addition, I randomly extracted 10000 reads from R2.fastq for blastn, queried the pollution sources, and found that nearly 9000 reads were mouse, so it is correct. But I want to know how reasonable it is to set the star parameter value about --outFilterScoreMinOverLread and --outFilterMatchNminOverLread , or need to modify other parameters or use other parameters? Thank you very much for your reply!

10KGenomics commented 2 years ago

This is my r2.fastq cleanata file, which randomly extracts 100000 reads. sample_R2.fastq.gz

alexdobin commented 2 years ago

Hi @caijingtao1993

I think the low mappability in your case may be caused by two issues:

  1. We see that the mapped length is significantly shorter than the read length. One reason for that could be that adapters are present in the read sequences (i.e. insert size was smaller than the read length). Another reason is poor sequencing quality in the tails of the reads.
  2. Mismatch rate is quite high (~0.88%) in the 2nd run. This could be due to poor sequencing quality, or the divergence of the genotype from the reference.

I would recommend checking sequencing quality (i.e. distribution of quality scores along the read length) and trimming the adapters and/or poor quality tails.

Cheers Alex

10KGenomics commented 2 years ago

Hi @caijingtao1993

I think the low mappability in your case may be caused by two issues:

  1. We see that the mapped length is significantly shorter than the read length. One reason for that could be that adapters are present in the read sequences (i.e. insert size was smaller than the read length). Another reason is poor sequencing quality in the tails of the reads.
  2. Mismatch rate is quite high (~0.88%) in the 2nd run. This could be due to poor sequencing quality, or the divergence of the genotype from the reference.

I would recommend checking sequencing quality (i.e. distribution of quality scores along the read length) and trimming the adapters and/or poor quality tails.

Cheers Alex

Hi @caijingtao1993

I think the low mappability in your case may be caused by two issues:

  1. We see that the mapped length is significantly shorter than the read length. One reason for that could be that adapters are present in the read sequences (i.e. insert size was smaller than the read length). Another reason is poor sequencing quality in the tails of the reads.
  2. Mismatch rate is quite high (~0.88%) in the 2nd run. This could be due to poor sequencing quality, or the divergence of the genotype from the reference.

I would recommend checking sequencing quality (i.e. distribution of quality scores along the read length) and trimming the adapters and/or poor quality tails.

Cheers Alex

Dear sir: Thank you very much for your reply. I conducted quality control on sample r2.fastq, and the data results are as follows: image image

It is true that there is low-quality data at the tail. I conducted data quality control filtering and star mapping. Code: Star -- genomedir $star INDEX --runThreadN ${Thread} --readFilesIn sample R2.fastq --outFileNamePrefix sample- ,the results are as follows : image image image

Although the only mapped using trim galore software increased from 58.33% to 63.89%, the growth rate was not high. In addition, it should not be the problem of reference genome. The only mapped samples in the same batch are as high as 80%. Dear sir, is there any possibility to improve the data?

alexdobin commented 2 years ago

Hi @caijingtao1993

after trimming you get only ~23% unmapped reads, which is not bad. You could try trimming the reads more aggressively, e.g. map just the first 100 bases. If this does not help, I would try BLASTing unmapped reads to check for contamination.

Cheers Alex