FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

High level of unassigned reads #24

Closed cap76 closed 5 years ago

cap76 commented 5 years ago

Possibly not an issue, but wanted to double check. I'm getting very low rates mapping to either genome when looking at a number of different mouse lines.

I'm using SNPsplit to identify reads coming from specific genomes (GRCm39 vs C3H_HeJ).

Basic pipeline is:

./SNPsplit_genome_preparation --vcf_file file.vcf --strain C3H_HeJ --reference_genome

STAR --runMode genomeGenerate --genomeDir pathtodir --genomeFastaFiles pathtogenome/*.fa

STAR --genomeDir pathtogenome --sjdbGTFfile pathtoannotation/genes.gtf --readFilesIn file1.fastq file2.fastq --quantMode GeneCounts --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outSAMtype BAM Unsorted

./SNPsplit --SNP_file pathtoSNPsfile/file.txt Aligned.out.bam

I'm getting mapping to GRCm38 of around 0.6% and to C3H_HeJ around 0.6%. Has anyone done similar and is this in line with the sort of numbers you would expect for mouse strains? I guess there's not that much difference between the mouse lines, and the data is from very early development, so may not be that much expressed.

FelixKrueger commented 5 years ago

Hi Cristopher,

When I generated a C3H genome over here, I am getting the following stats:

SNP position summary for strain C3H_HeH (based on mouse genome build GRCm38)
===========================================================================

Positions read in total:        78772544

5084521 SNP were homozygous. Of these:
4813792 SNP were homozygous and passed high confidence filters and were thus included into the C3H_HeH genome

Not included into C3H_HeH genome:
72218381        had the same sequence as the reference
4808            had no clearly defined alternative base
1464834         Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
270729          were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_C3H_HeH_GRCm38.txt.gz<...

So the total number of SNPs incorporated is ~ 4.8 million, I am assuming that the C3H_HeJ strain will probably quite similar to this. I trust that your STAR pipeline worked well overall, and in conjunction with SNPsplit (we personally don't run STAR over here)?

In my experience, the majority of SNPs fall into intergenic, or intronic regions (which kind of makes sense as you do can't afford to change too many coding sequences), meaning that RNA-seq is often a little underwhelming when it comes to allele-specific reads.

I just checked the SNPs of C3H quickly, and did indeed find that ~2.2M SNPs fall into 'gene' annotations, and only ~165,000 SNPs overlap with exons. If you combine this fairly low overall number of SNPs that can be interrogated with RNA-seq with relatively short sequencing reads, such as 50bp single-end, I wouldn't be too surprised to get a quite low number of reads that can be classified as allele-specific.

cap76 commented 5 years ago

Thanks @FelixKrueger - the numbers look similar to mine. The STAR pipeline seems to work fine, so I think the results are the results - there's just a low number of reads that can be classified as allele-specific in this instance. Thanks for your help as always :-)

FelixKrueger commented 5 years ago

I'm glad this agrees with what you are seeing. Best, Felix