Closed yonniejon closed 1 year ago
Happy New Year,
and apologies for the late reply. I think the the problem you are facing arises from the fact that your steps 1 and 2 need to be carried out in inverse order.
1) You need to run the SNPsplit genome preparation of the GRCm38 genome using the v5 VCF file. This will generate the N-masked version of the new dual hybrid genome (the folder name should be fairly telling).
2) You need to run STAR genomeGenerate
process on the new N-masked sequences.
3) Align your FastQ files to the N-masked STAR index, and then run SNPsplit. This should then all work.
PS: If you are already using the GRCm39 genome version, you should be using the v8 annotations to go with the latest genome version. This will soon be the default.
Thanks for the reply!
Ah, I see. Thanks. I will try that.
I am currently using GRCm38 and not 39. Thank you!
Just to make sure I am understanding correctly - SNPsplit_genome_preparation creates an output CAST_EiJ_FVB_NJ_dual_hybrid.based_on_GRCm38_N-masked where inside there are fasta files for each chromosome. I should use those fasta files as the input to Star's genome generate command?
I did this and I still got 0 reads processed. This is part of the SNPsplit output:
Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total: 5484876
thereof were read pairs: 5459714
thereof were singletons: 25162
Reads were unassignable (not overlapping SNPs): 5484876 (100.00%)
thereof were read pairs: 5459714
thereof were singletons: 25162
Reads were specific for genome 1: 0 (0.00%)
thereof were read pairs: 0
thereof were singletons: 0
Reads were specific for genome 2: 0 (0.00%)
thereof were read pairs: 0
thereof were singletons: 0
Reads contained conflicting SNP information: 0 (0.00%)
thereof were read pairs: 0
thereof were singletons: 0
Sorting finished successfully
Appending sorting report to tagging report...
SNPsplit processing finished... [Allele-tagging + tag2sort]
I think the problem might be in the SNPsplit genome prepare command?
For both the CAST and the FVB strain the genome_preparation_report.txt looks like:
cat CAST_EiJ_genome_preparation_report.txt
Summary
0 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total
0 SNPs were newly introduced into the full sequence genome version for strain CAST_EiJ in total
Summary
0 Ns were newly introduced into the N-masked genome for strain 2 [FVB_NJ] in total
0 SNPs were newly introduced into the full sequence genome version for strain 2 [FVB_NJ] in total
Yes, that would certainly be a reason.... Which eventually leads to:
SNP coverage report
===================
SNP annotation file: GRCm38_reference/all_FVB_NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total: 20581027
N-containing reads: 0
total: 11872436
Can you link the commands you used, and maybe attach the log files? Usually when this happens people do not use the Ensembl genome but an NCBI/UCSC genome (where chromosome names are different). Is there a chance this could be true in your case?
Yes. that is the case. I used the genome from gencode here https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
But I thought it was ok because I downloaded the vcf file from the mouse genomes project linked in the documentation of SNPsplit here and then I added the "chr" prefix to all values in the chromosome column. But I will retry everything with an Ensembl genome. and the original VCF file.
good luck! Let me know how you get on.
It worked! Thank you very much and apologies for the naive questions/issues.
Excellent!
It worked! Thank you very much and apologies for the naive questions/issues.
Hi, I met the same problem. How did you solve it in the end? Which version of the genome did you ultimately choose? Thanks a lot!!
The steps are shown here: https://felixkrueger.github.io/SNPsplit/workflow/
(with a link to the genome to download under point 2)). I hope this helps?
The steps are shown here: https://felixkrueger.github.io/SNPsplit/workflow/
(with a link to the genome to download under point 2)). I hope this helps?
It helps a lot, thank you. I changed the name of the chromosome in the reference genome to match the vcf file, and it worked.
These are the commands I am running:
The bam files are not empty and I have manually investigated various heterozygous sites in the sample using IGV. But the bam files generated by SNPsplit are empty and the SNPsplit report looks like:
The output from SNPsplit itself looks like:
Any idea why I am getting 0 reads processed? Not sure where to start