Closed Lu-103 closed 2 years ago
Hi Lu,
The issue seems to be the following:
N was not present in the list of SNPs: 1741728 (100.00%)
So it seems that the SNP file you are using doesn't match with the aligned bam files. A relatively simple explanation would be if you were using a SNP file from based on the VCF file that is based on the Sanger/Ensembl nomenclature, and a genome based on the NBCI/UCSC nomenclature, is that a possibility? The issue here might be that chromosomes are called chr1, chr2, chrX, chrMT
for Ensembl, and 1, 2, X and M
for NCBI.
Can you try downloading the genome again from Ensembl here:
ftp://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/
and try again (please make sure it is the GRCm38 genome build (mm10)). Let me know how you get on.
Hi Felix,
Thank you for pointing it out, I actually noticed the mismatch between chromosome names while doing genome preparation. At first, it was showing no Ns were introduced. The way I solved it was that I changed the names in my reference genome file, and the report was showing: Summary 1327448 Ns were newly introduced into the N-masked genome for strain DBA_2J in total So I assumed that this problem was solved and I went on to do the mapping and SNPsplit.
Would you think this way solved the problem you mentioned above? If not, I will try to download the genome from Ensembl.
Thank you so much for the immediate reply! Lu
Hmm, something seems to have gone awry along the way...
The name change you introduced lead to 1327448 positions being N-masked, yet above you showed:
SNP annotation file: /DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz
SNPs stored in total: 5,872,463
As a side note, the SNP annotation file produced by SNPsplit_genome_preparation
should not end in .vcf.gz
but should be something like: all_SNPs_DBA_2J_GRCm38.txt.gz
If you take the SNP file mgp.v5.merged.snps_all.dbSNP142.vcf.gz
as well as the GRCm38 genome from the previous note, you should get this:
SNP position summary for strain DBA_2J (based on mouse genome build GRCm38)
===========================================================================
Positions read in total: 78772544
5391236 SNP were homozygous. Of these:
5184367 SNP were homozygous and passed high confidence filters and were thus included into the DBA_2J genome
...
Also note that it is now also possible to use the latest version of the SNP annations (from May2020) by using the file mgp_REL2005_snps_indels.vcf.gz
and --v7
. More info on that here.
Maybe to clarify, if you consistently modify the genome sequence, the VCF file, the SNP annotation file and the Bowtie2/HISAT2 N-masked indexes, it should also work with the NCBI genome. Then again, you might just use the Ensembl version (which is the same genome), and it should all work straight out of the box.
My report from genome preparation seems fine:
Positions read in total: 5872394
5391236 SNP were homozygous. Of these: 5184367 SNP were homozygous and passed high confidence filters and were thus included into the DBA_2J genome
To make sure I understand it correctly, the SNP annotation file used for SNPsplit should be a txt.gz file generated by SNPsplit_genome_preparation? If so, I might use the wrong file.
Thanks for the clarification, I think I make the situation unnecessarily complicated, I will try again by using the Ensembl version if correcting the txt.gz file mistake could not solve the problem. Sorry for making this so confusing... Lu
:P yes indeed. I believe by default it is just a .txt file (not gzip compressed), but yes, it is NOT the VCF file:
--snp_file <file> Mandatory file specifying SNP positions to be considered; may be a plain text file or
gzip compressed. Currently, the SNP file is expected to be in the following format:
SNP-ID Chromosome Position Strand Ref/SNP
example: 33941939 9 68878541 1 T/G
Only the information contained in fields 'Chromosome', 'Position' and 'Reference/SNP base'
are being used for analysis. The genome referred to as 'Ref' will be used as genome 1,
the genome containing the 'SNP' base as genome 2.
Thank you so much for your help! I used the txt file, and I think it is working now! Lu
That's excellent, let me know if you got some sensible results!
Dear Felix,
Hi, thanks for this tool. I have been trying to use it to analyze mouse RNA-seq data, however, this is what I always get for the report:
SNP coverage report
SNP annotation file: /DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz SNPs stored in total: 5872463 N-containing reads: 1497126 non-N: 29627111 total: 53183660 Reads had a deletion of the N-masked position (and were thus called Conflicting): 2227 (0.00%) Of which had multiple deletions of N-masked positions within the same read: 1 (0.00%)
Of valid N containing reads, N was present in the list of known SNPs: 0 (0.00%) N was not present in the list of SNPs: 1741728 (100.00%)
Allele-specific paired-end sorting report
Read pairs/singletons processed in total: 15563232 thereof were read pairs: 15563232 thereof were singletons: 0 Reads were unassignable (not overlapping SNPs): 15561496 (99.99%) thereof were read pairs: 15561496 thereof were singletons: 0 Reads were specific for genome 1: 0 (0.00%) thereof were read pairs: 0 thereof were singletons: 0 Reads were specific for genome 2: 0 (0.00%) thereof were read pairs: 0 thereof were singletons: 0 Reads contained conflicting SNP information: 1736 (0.01%) thereof were read pairs: 1736 thereof were singletons: 0
I manually checked some of the RNA-seq data and found there are quite some SNP positions overlapping. I am wondering if you have any clues that might cause this problem since I am a beginner in genome data analysis and it's very likely that I may make some obvious mistakes. Some additional info that might be helpful: I used genome preparation to generate the N-masked genome, and have tried both mm9 and mm10. (SNP annotation file used: DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz). I have tried both bowtie2 and hisat2 for mapping RNA-seq data to the N-masked ref genome.
Thank you so much and look forward to your replies! Lu