Closed Hemantcnaik closed 3 years ago
Hi Hemant,
If I understand you correctly, you've made up the SNP files for strain JF1_MsJ
yourself? In order to get the required files for 129S1_SvImJ you should be able to simply run SNPsplit for the just that strain (with --full_sequence
) which is required as the base genome for the second strain.
In theory, as long as you've got the latest version of SNPsplit, it should work pretty much as described here (#41):
I theory you should be able use this information as to construct a genome, but you will need to adhere to the same rules that the script uses internally.
First of all, you cannot use / characters in names as it would probably do bad things. I would recommend an underscore _ instead, so maybe:
--strain JF1_Ms
Next, the SNPs will need to be contained within a folder called:
SNPs_JF1_Ms
inside there, the script expects one file per chromosome, e.g. like so:
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrMT.txt
chrX.txt
chrY.txt
So you will have to split your tsv file by chromosome. And finally, these file need to adhere to the following format:
>10
42459235 10 3101362 1 G/C 1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276 10 3102112 1 T/C 1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298 10 3102661 1 C/G 1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325 10 3103479 1 A/G 1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1
...
where the first line is the name of the chromosome (like in a FastA file).
this is also explained in the option:
--skip_filtering This option skips reading and filtering the VCF file. This assumes that a folder named
'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
are contained therein in the following format:
SNP-ID Chromosome Position Strand Ref/SNP
example: 33941939 9 68878541 1 T/G
I believe column 6 is optional, but the first 5 need to be present. It is sufficient to put 1 as strand if the sequence is based relative to the top strand.
So do you have this SNP_folder? Do the files in there look like described above?
I just checked the Sanger FTP site and noticed that there is an even newer version of the MGP file, version 7 (ftp://ftp-mouse.sanger.ac.uk/REL-2004-v7-SNPs_Indels/mgp_REL2005_snps_indels.vcf.gz). Initially we agreed to wait with adapting the genome preparation script until there is a new, official release (in Current_SNPs, version 5 is still the one listed). There hasn't been an official new release since 2015 though, and v7 is from May 2020. Maybe it would be worth for me to take a look at whether it is feasiblet to support the v7 file, what do you think?
mgp_REL2005_snps_indels.vcf.gz
Hi Felix
Thank you for the reply...
I have the SNPs_JF1_MsJ and SNPs_129S1_SvImJ folder as mentioned format, may be one mistake is that base genome I have used is C57 reference genome(GRCm38). How to get the 129S1 as base genome can I create a in-silico genome incorporating a homologous SNPs to GRCm38 genome is it Okay or is there any other way.
SNP_folder file should contain heterogeneous SNPs or what? because I just split the v6 VCF file and created the format as u mentioned above
I will check the v7 vcf file once and get back to you
Thank You
I am currently taking a look at the v7 file (mgp_REL2005_snps_indels.vcf.gz
)
from what I can see already it looks like the information we need is all there:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup -Ou -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500 -f Mus_musculus.GRCm38.68.dna.toplevel.fa
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##FILTER=<ID=LowConfidence,Description="Set if not true: FORMAT/FI[*]=1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##VEP="v97" time="2020-05-08 02:30:32" cache="/nfs/production/mousegenomes/bin/ensembl-vep-release-97.2/mus_musculus/97_GRCm38" ensembl-funcgen=97.24f4d3c ensembl-variation=97.e9e14e5 ensembl=97.378db18 ensembl-io=97.dc917e1 assembly="GR
Cm38.p6" dbSNP="150" gencode="GENCODE M22" genebuild="2012-07" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Ami
no_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2_OlaHsd 129S1_SvImJ 129S5SvEvBrd A_J AKR_J B10.RIII BALB_cByJ BALB_cJ BTBR_T+_Itpr3tf_J BUB_BnJ C3H_HeH C3H_HeJ C57BL_10J C57BL_10SnJ C57BL_6NJ C57BR_cdJ C57L_J C58_J CAST_EiJ CBA_J CE_J CZECHII_EiJ DBA_1J DBA_2J FVB_NJ I_LnJ JF1_MsJ KK_HiJ LEWES_EiJ LG_J LP_J MA_MyJ MOLF_EiJ NOD_ShiLtJ NON_L
tJ NZB_B1NJ NZO_HlLtJ NZW_LacJ PL_J PWK_PhJ QSi3 QSi5 RF_J RIIIS_J SEA_GnJ SJL_J SM_J SPRET_EiJ ST_bJ SWR_J WSB_EiJ ZALENDE_EiJ
I'll take a quick look to see how hard it would be to adapt the process...
Okay I will check with this vcf file
I have now updated SNPsplit_genome_preparation
, and while it is still running as we speak, I think it might be working! (addressed here: 9a7c5b1b47e1fe1f95ee2c97275109b01f1444c7).
Can you please download the v7 version of the VCF file, clone the latest development version of SNPsplit (git clone https://github.com/FelixKrueger/SNPsplit.git
), and then try this command:
SNPsplit_genome_preparation --v7 --vcf mgp_REL2005_snps_indels.vcf.gz --ref /Genomes/Mouse/GRCm38/ --strain 129S1_SvImJ --strain2 JF1_MsJ
I will update again once the process is complete.
It seems to have worked, just like that!
SNP position summary for strain 129S1_SvImJ (based on genome build GRCm38)
===========================================================================
Positions read in total: 92510146
12641182 Positions were INDELs (and hence skipped)
5521403 SNP were homozygous. Of these:
5220884 SNP were homozygous and passed high confidence filters and were thus included into the 129S1_SvImJ genome
Not included into 129S1_SvImJ genome:
73117174 had the same sequence as the reference
0 had no clearly defined alternative base
1230387 Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
300519 were homozygous but the filtering call was low confidence
Now printing a single list of all SNPs to >all_SNPs_129S1_SvImJ_GRCm38.txt.gz<...
SNP position summary for strain JF1_MsJ (based on genome build GRCm38)
===========================================================================
Positions read in total: 92510146
12641182 Positions were INDELs (and hence skipped)
21075575 SNP were homozygous. Of these:
19626002 SNP were homozygous and passed high confidence filters and were thus included into the JF1_MsJ genome
Not included into JF1_MsJ genome:
55512115 had the same sequence as the reference
0 had no clearly defined alternative base
3281274 Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
1449573 were homozygous but the filtering call was low confidence
Now printing a single list of all SNPs to >all_SNPs_JF1_MsJ_GRCm38.txt.gz<...
Looked at positions from new Reference strain [129S1_SvImJ]: 5220884
Compared positions from new SNP strain [JF1_MsJ]: 19626002
======================================================
SNPs were the same in Ref and SNP genome (not written out): 2385893
SNPs were present in both Ref and SNP genome but had a different sequence: 16400
SNPs were low confidence in one strain and thus ignored: 442915
SNPs were unique to Ref [129S1_SvImJ]: 2521111
SNPs were unique to SNP [JF1_MsJ]: 17078274
Changing the genomic reference sequence to the full sequence of strain 129S1_SvImJ
Reading and storing all new SNPs with Ref/SNP: 129S1_SvImJ/JF1_MsJ from 'all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt'
Summary
19615785 Ns were newly introduced into the N-masked genome for strainstrain 2 [129S1_SvImJJF1_MsJ] in total
19615785 SNPs were newly introduced into the full sequence genome version for strainstrain 2 [129S1_SvImJ/JF1_MsJ] in total
All it needs now should be indexing of the N-masked genome using your favourite aligner (except if your favourite aligner is BWA...), and SNP is your uncle!
Hello Felix
Thank you, for your support It ran smoothly.
I am indexing data from the folder 129S1_SvImJ_JF1_MsJ_dual_hybrid.based_on_GRCm38_full_sequence,
Aligner I am using bowtie2 its a paired end data
Next SNPsplit
for 129S1
SNPsplit --paired --snp_file N_masked_data/all_SNPs_129S1_SvImJ_GRCm38.txt.gz --input data1.bam -o Result_129S1
for JF1 SNPsplit --paired --snp_file N_masked_data/all_SNPs_129S1_SvImJ_GRCm38.txt.gz --input data1.bam -o Result_JF1
The above code is looking fine or is there any changes I have to do Thank you
About alignment I have used bowtie2 with default parameter if you experience with it is there any specific parameter I can use please mention its a ChIP-seq data set
Thank you
Hi Hemant,
No, you will need to index the files in folder 129S1_SvImJ_JF1_MsJ_dual_hybrid.based_on_GRCm38_N-masked
, as you will need to align against an N-masked genome (and not the full name). A command for indexing should look like this:
bowtie2-build --threads 4 chr10.N-masked.fa,chr11.N-masked.fa,chr12.N-masked.fa,chr13.N-masked.fa,chr14.N-masked.fa,chr15.N-masked.fa,chr16.N-masked.fa,chr17.N-masked.fa,chr18.N-masked.fa,chr19.N-masked.fa,chr1.N-masked.fa,chr2.N-masked.fa,chr3.N-masked.fa,chr4.N-masked.fa,chr5.N-masked.fa,chr6.N-masked.fa,chr7.N-masked.fa,chr8.N-masked.fa,chr9.N-masked.fa,chrMT.N-masked.fa,chrX.N-masked.fa,chrY.N-masked.fa 129S1_SvImJ_JF1_MsJ_dual_hybrid_N-masked
Then you will only need a single command to split the data allele-specifically, something like this:
SNPsplit --snp_file all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt your_aligned_file.bam
In theory, SNPsplit should detect that the file is paired-end. The files in ...genome1.bam
will be specific for ...129S1_SvImJ
, and genome2.bam
will be specific to JF1_MsJ
.
We personally add --no-discordant --no-mixed
to our Bowtie2 alignments so that only concordant paired-end alignments are output, but that is a matter of personal preference. Good luck!
Thank you, for your suggestions ....
Closing this to prepare a new release.
Hello, Great tool I am trying with the v6 version of VCF( ftp://ftp-mouse.sanger.ac.uk/REL-1807-SNPs_Indels/mgp.v6.merged.norm.snp.indels.sfiltered.vcf.gz ) because of desired genome its not present in V5,
I tried with seperate vcf file for 129S1 and JF1 and tried to followed steps in issues #31 , #41 still get empty files as a output command I tried as below
SNPsplit_genome_preparation --reference /media//SNPsplit/ --skip_filtering --dual_hybrid --strain 129S1_SvImJ --strain2 JF1_MsJ
output
Summary 0 Ns were newly introduced into the N-masked genome for strain 2 [JF1_MsJ] in total Determining new Ref [129S1_SvImJ] and SNP [JF1_MsJ] annotations
Writing 129S1_SvImJ specific SNPs (relative to the GRCm38 reference) to >>129S1_SvImJ_specific_SNPs.GRCm38.txt<< Writing JF1_MsJ specific SNPs (relative to the GRCm38 reference) to >>JF1_MsJ_specific_SNPs.GRCm38.txt<< Writing SNPs in common between 129S1_SvImJ and JF1_MsJ (relative to the GRCm38 reference) to >>129S1_SvImJ_JF1_MsJ_SNPs_in_common.GRCm38.txt<< Writing all new SNPs >>129S1_SvImJ/JF1_MsJ to >>all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt<<
Expected SNP file [all_SNPs_129S1_SvImJ_GRCm38.txt.gz] for strain 129S1_SvImJ did not exist! Please make sure that it is present in the current working directory