alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 505 forks source link

SmartSeq v4 libraries - 2% mapping rate #971

Open Yogita121984 opened 4 years ago

Yogita121984 commented 4 years ago

Hi I am working with Takara Smartseq4 "reversely stranded" libraries. Surprisingly most of the samples have 2-10% mapping rate and everything remaining is "too short to map". I checked read length and that seems fine. Any help/input would be of great help. These are human cells, sequenced using a nextseq 500 machine (2*75 bases). Thanks

alexdobin commented 4 years ago

Hi Yogita,

I am not sure what "reversely stranded" means in this case: is it read1 reverse to RNA, read2 same as RNA? This is actually the same as standard Illumina TruSeq, so should not be a problem.

The first thing I would recommend is to map one mate at a time, to check if there is a problem with pairing of the reads.

Cheers Alex

Yogita121984 commented 4 years ago

Hi Alex Thanks a lot for coming back. Yes, it's standard Illumina truseq. I tried fastq_screen as well, just to check for any contamination and surprisingly a huge fraction (~50-55%) is mapping (uniquely) to mouse. We could not find any possible source of contamination during library prep or sample collection. These should be human cells. This makes me wonder, since mouse and human genome shares high similarity, could it be result of that? But then at least those reads should have mapped to human as well. I ran STAR with default parameters, I will now try the following: 1) Trim first 3 bases of Read2 - recommended by Takara kits company. 2) Try STAR with few number of mismatches allowed.

Please let me know if you have any other suggestions/recommendations?. This is very puzzling 😳.

Yogita121984 commented 4 years ago

I will also follow the suggestion of mapping Read1 and read2 separately as single end data.

alexdobin commented 4 years ago

Hi Yogita,

I think mapping Read1/2 separately is the best first step. If Read1 and Read2 do not map well separately, you can BLAST the unmapped reads to check for contamination. If mouse hits appear "stronger" (i.e. fewer mismatches/indels) than human, it would definitely point to contamination. Another way is to remap the unmapped reads with STAR to the mouse genome.

Cheers Alex

Yogita121984 commented 4 years ago

Hi Alex, First of all Thanks a lot for helping me out. I have now tried the following: Step 1) Map read 1 and read 2 separately (as single ends). Observation: Overall %age of Read1 mapping is slightly higher (~3-4%) in comparison to Read2. If I compare these percentages with mapping as paired end, it is 4-5% higher in general. I have total of 14 cells (these are single cell in each well prepared with SmartSeq2 protocol).

Step2) I took unmapped reads generated as "Unmapped.out.mate1" and mapped to mouse genome (separately for READ1 and READ2). It varies for each samples but around 39 to 82% of unmapped reads are mapping to mouse genome.

The only strange thing here is, that we could not find any possible contamination reasons i(n library prep and sample collection) to explain this observation.

It would be very helpful if you could share your thoughts on the following:

1) Is it normal to have 4-5% of difference in paired-end vs single end mapping of datasets? 2) I am using standard/default STAR mapping settings (only looking at unique alignments -- OutFilterMultimapNmax 1). I also tried changing --sjdbOverhang 74 (my data is 2*75 base pair). But I did not see any improvement in mapping by changing sjdbOverhang to 74. 3) I can try allowing fewer mismatches than default - but honestly, I do not think it is going to make any big difference. In my experience, most of the times STAR works really well with default settings. 4) I have observed that overall mapping rate of Smart-seq2 libraries is not great (usually ~55-69% in my other datasets). 5) Lastly, I am using STAR version 2.5.0a, now checking with 2.7.0 - will update you in case it makes any change. 6) Any reason for me to switch to STAR SOLO? I see that you have some Smart-Seq2 libraries specific options, but my data does not have UMI's, so I am assuming it will not work in this case.

7) Any other thing that comes to your mind, that might help explaining this observation computationally?

PS: I also ran fastq_screen on these samples and mapped reads (using default aligner bowtie2) to mouse, human, adapter, vectors etc. and observation is same - There are reads in each sample that uniquely aligns to human and mouse.

Thanks a lot. Looking forward to hearing from you. Best, Yogita

alexdobin commented 4 years ago

Hi Yogita,

if a sample maps better to the mouse genome than to a human genome, there is no way to reverse that computationally. It's most likely contamination at some point before the FASTQ files are produces. If a samples maps really well to a mouse genome (i.e. low mismatch rate <1% and large mapped length, >95% of the read length), it could be a sample swap issue, i.e. a human sample got mixed up with a mouse sample.

Overall low rate of mapping is a warning sign. Again, I would BLAST the unmapped reads to check for contamination.

STARsolo counts reads per cell per gene, but it does not change the alignments.

Cheers Alex

Yogita121984 commented 4 years ago

Thanks a lot Alex. I guess in this case, since I do see a good mapping rate to mouse, I think human sample somehow got mixed with mouse. Since these were single cells, we do have strong biological knowledge about the cells we are looking into, strangely I do see the expected gene expression BUT to mouse instead of human.

Additionally, even in mouse mapping only ~5% of total reads that maps to mouse are contributing to genes (Feature counts output). These were rRNA depleted libraries, I am now looking more into this.

I will do blast as well. Thanks once again. I will post my updates. Best, Yogita

alexdobin commented 4 years ago

Hi Yogita,

there is another test that you can run - map each cell simultaneously to mouse and human genome, and count unique alignments to each genome. This is a typical test that is often done for an artificial mixture of mouse and human cells, to check if the technology creates doublets. I guess in your case you may see that either each cell has both human and mouse unique reads, or you have a mixture of mouse and human cells, so it might point to different sources of contamination.

5% "genic" reads are too small, again it strongly points to issues with the libraries. I would check the duplication rate of the sequences. If you see a very low diversity of the sequences, it might point to overamplification of some background RNA instead of the cells of interest.

Good luck Alex

Yogita121984 commented 4 years ago

Hi Alex Mapping simultaneously to human and mouse is a great idea. I have done it for some of my 10X samples to separate host vs grafts single cells from Rat and Human but have no experience of doing it directly with STAR.

Does STAR accepts two .fasta and GTF files? Or am I suppose to combine them before generating the index?

I agree, 5% genic reads are too small, but even if we have some background RNA, it should have mapped to human not mouse.

Best, Yogita

alexdobin commented 4 years ago

Hi Yogita,

to combine the genomes, you can specify separate FASTA files - but the chromosome names have to be made distinct between species. The GTF files (with fixed chromosome names) need to be concatenated into one file. Also, please make sure that the transcript_id and gene_id are distinct between species - though it's usually the case in most databases (e.g. Gencode, Ensembl).

Cheers Alex