FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

SNPsplit for mouse RNA-seq data #59

Closed Lu-103 closed 2 years ago

Lu-103 commented 2 years ago

Dear Felix,

Hi, thanks for this tool. I have been trying to use it to analyze mouse RNA-seq data, however, this is what I always get for the report:

SNP coverage report

SNP annotation file: /DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz SNPs stored in total: 5872463 N-containing reads: 1497126 non-N: 29627111 total: 53183660 Reads had a deletion of the N-masked position (and were thus called Conflicting): 2227 (0.00%) Of which had multiple deletions of N-masked positions within the same read: 1 (0.00%)

Of valid N containing reads, N was present in the list of known SNPs: 0 (0.00%) N was not present in the list of SNPs: 1741728 (100.00%)

Allele-specific paired-end sorting report

Read pairs/singletons processed in total: 15563232 thereof were read pairs: 15563232 thereof were singletons: 0 Reads were unassignable (not overlapping SNPs): 15561496 (99.99%) thereof were read pairs: 15561496 thereof were singletons: 0 Reads were specific for genome 1: 0 (0.00%) thereof were read pairs: 0 thereof were singletons: 0 Reads were specific for genome 2: 0 (0.00%) thereof were read pairs: 0 thereof were singletons: 0 Reads contained conflicting SNP information: 1736 (0.01%) thereof were read pairs: 1736 thereof were singletons: 0

I manually checked some of the RNA-seq data and found there are quite some SNP positions overlapping. I am wondering if you have any clues that might cause this problem since I am a beginner in genome data analysis and it's very likely that I may make some obvious mistakes. Some additional info that might be helpful: I used genome preparation to generate the N-masked genome, and have tried both mm9 and mm10. (SNP annotation file used: DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz). I have tried both bowtie2 and hisat2 for mapping RNA-seq data to the N-masked ref genome.

Thank you so much and look forward to your replies! Lu

FelixKrueger commented 2 years ago

Hi Lu,

The issue seems to be the following:

N was not present in the list of SNPs: 1741728 (100.00%)

So it seems that the SNP file you are using doesn't match with the aligned bam files. A relatively simple explanation would be if you were using a SNP file from based on the VCF file that is based on the Sanger/Ensembl nomenclature, and a genome based on the NBCI/UCSC nomenclature, is that a possibility? The issue here might be that chromosomes are called chr1, chr2, chrX, chrMT for Ensembl, and 1, 2, X and M for NCBI.

Can you try downloading the genome again from Ensembl here:

ftp://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/

and try again (please make sure it is the GRCm38 genome build (mm10)). Let me know how you get on.

Lu-103 commented 2 years ago

Hi Felix,

Thank you for pointing it out, I actually noticed the mismatch between chromosome names while doing genome preparation. At first, it was showing no Ns were introduced. The way I solved it was that I changed the names in my reference genome file, and the report was showing: Summary 1327448 Ns were newly introduced into the N-masked genome for strain DBA_2J in total So I assumed that this problem was solved and I went on to do the mapping and SNPsplit.

Would you think this way solved the problem you mentioned above? If not, I will try to download the genome from Ensembl.

Thank you so much for the immediate reply! Lu

FelixKrueger commented 2 years ago

Hmm, something seems to have gone awry along the way...

The name change you introduced lead to 1327448 positions being N-masked, yet above you showed:

SNP annotation file: /DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz
SNPs stored in total: 5,872,463

As a side note, the SNP annotation file produced by SNPsplit_genome_preparation should not end in .vcf.gz but should be something like: all_SNPs_DBA_2J_GRCm38.txt.gz

If you take the SNP file mgp.v5.merged.snps_all.dbSNP142.vcf.gz as well as the GRCm38 genome from the previous note, you should get this:

SNP position summary for strain DBA_2J (based on mouse genome build GRCm38)
===========================================================================

Positions read in total:    78772544

5391236 SNP were homozygous. Of these:
5184367 SNP were homozygous and passed high confidence filters and were thus included into the DBA_2J genome
...

Also note that it is now also possible to use the latest version of the SNP annations (from May2020) by using the file mgp_REL2005_snps_indels.vcf.gz and --v7. More info on that here.

FelixKrueger commented 2 years ago

Maybe to clarify, if you consistently modify the genome sequence, the VCF file, the SNP annotation file and the Bowtie2/HISAT2 N-masked indexes, it should also work with the NCBI genome. Then again, you might just use the Ensembl version (which is the same genome), and it should all work straight out of the box.

Lu-103 commented 2 years ago

My report from genome preparation seems fine:

SNP position summary for strain DBA_2J (based on genome build GRCm38)

Positions read in total: 5872394

5391236 SNP were homozygous. Of these: 5184367 SNP were homozygous and passed high confidence filters and were thus included into the DBA_2J genome

To make sure I understand it correctly, the SNP annotation file used for SNPsplit should be a txt.gz file generated by SNPsplit_genome_preparation? If so, I might use the wrong file.

Thanks for the clarification, I think I make the situation unnecessarily complicated, I will try again by using the Ensembl version if correcting the txt.gz file mistake could not solve the problem. Sorry for making this so confusing... Lu

FelixKrueger commented 2 years ago

:P yes indeed. I believe by default it is just a .txt file (not gzip compressed), but yes, it is NOT the VCF file:

--snp_file <file>      Mandatory file specifying SNP positions to be considered; may be a plain text file or
                           gzip compressed. Currently, the SNP file is expected to be in the following format:

                                       SNP-ID     Chromosome  Position    Strand   Ref/SNP
                           example:   33941939        9       68878541       1       T/G

                           Only the information contained in fields 'Chromosome', 'Position' and 'Reference/SNP base'
                           are being used for analysis. The genome referred to as 'Ref' will be used as genome 1,
                           the genome containing the 'SNP' base as genome 2.
Lu-103 commented 2 years ago

Thank you so much for your help! I used the txt file, and I think it is working now! Lu

FelixKrueger commented 2 years ago

That's excellent, let me know if you got some sensible results!