How to use the SNPsplit for higher version of VCF

FelixKrueger / SNPsplit

Allele-specific alignment sorting

http://felixkrueger.github.io/SNPsplit/

GNU General Public License v3.0

53 stars 20 forks source link

How to use the SNPsplit for higher version of VCF #44

Closed Hemantcnaik closed 3 years ago

Hemantcnaik commented 3 years ago

Hello, Great tool I am trying with the v6 version of VCF( ftp://ftp-mouse.sanger.ac.uk/REL-1807-SNPs_Indels/mgp.v6.merged.norm.snp.indels.sfiltered.vcf.gz ) because of desired genome its not present in V5,

I tried with seperate vcf file for 129S1 and JF1 and tried to followed steps in issues #31 , #41 still get empty files as a output command I tried as below

SNPsplit_genome_preparation --reference /media//SNPsplit/ --skip_filtering --dual_hybrid --strain 129S1_SvImJ --strain2 JF1_MsJ

output

Summary 0 Ns were newly introduced into the N-masked genome for strain 2 [JF1_MsJ] in total Determining new Ref [129S1_SvImJ] and SNP [JF1_MsJ] annotations

Writing 129S1_SvImJ specific SNPs (relative to the GRCm38 reference) to >>129S1_SvImJ_specific_SNPs.GRCm38.txt<< Writing JF1_MsJ specific SNPs (relative to the GRCm38 reference) to >>JF1_MsJ_specific_SNPs.GRCm38.txt<< Writing SNPs in common between 129S1_SvImJ and JF1_MsJ (relative to the GRCm38 reference) to >>129S1_SvImJ_JF1_MsJ_SNPs_in_common.GRCm38.txt<< Writing all new SNPs >>129S1_SvImJ/JF1_MsJ to >>all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt<<

Expected SNP file [all_SNPs_129S1_SvImJ_GRCm38.txt.gz] for strain 129S1_SvImJ did not exist! Please make sure that it is present in the current working directory

FelixKrueger commented 3 years ago

Hi Hemant,

If I understand you correctly, you've made up the SNP files for strain JF1_MsJ yourself? In order to get the required files for 129S1_SvImJ you should be able to simply run SNPsplit for the just that strain (with --full_sequence) which is required as the base genome for the second strain.

In theory, as long as you've got the latest version of SNPsplit, it should work pretty much as described here (#41):

I theory you should be able use this information as to construct a genome, but you will need to adhere to the same rules that the script uses internally.

First of all, you cannot use / characters in names as it would probably do bad things. I would recommend an underscore _ instead, so maybe:

--strain JF1_Ms

Next, the SNPs will need to be contained within a folder called:

SNPs_JF1_Ms

inside there, the script expects one file per chromosome, e.g. like so:

chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrMT.txt
chrX.txt
chrY.txt

So you will have to split your tsv file by chromosome. And finally, these file need to adhere to the following format:

>10
42459235        10      3101362 1       G/C     1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276        10      3102112 1       T/C     1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298        10      3102661 1       C/G     1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325        10      3103479 1       A/G     1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1
...

where the first line is the name of the chromosome (like in a FastA file).

this is also explained in the option:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:
                                      SNP-ID     Chromosome  Position    Strand   Ref/SNP
                          example:   33941939        9       68878541       1       T/G

I believe column 6 is optional, but the first 5 need to be present. It is sufficient to put 1 as strand if the sequence is based relative to the top strand.

So do you have this SNP_folder? Do the files in there look like described above?

I just checked the Sanger FTP site and noticed that there is an even newer version of the MGP file, version 7 (ftp://ftp-mouse.sanger.ac.uk/REL-2004-v7-SNPs_Indels/mgp_REL2005_snps_indels.vcf.gz). Initially we agreed to wait with adapting the genome preparation script until there is a new, official release (in Current_SNPs, version 5 is still the one listed). There hasn't been an official new release since 2015 though, and v7 is from May 2020. Maybe it would be worth for me to take a look at whether it is feasiblet to support the v7 file, what do you think? mgp_REL2005_snps_indels.vcf.gz

Hemantcnaik commented 3 years ago

Hi Felix

Thank you for the reply...

I have the SNPs_JF1_MsJ and SNPs_129S1_SvImJ folder as mentioned format, may be one mistake is that base genome I have used is C57 reference genome(GRCm38). How to get the 129S1 as base genome can I create a in-silico genome incorporating a homologous SNPs to GRCm38 genome is it Okay or is there any other way.

SNP_folder file should contain heterogeneous SNPs or what? because I just split the v6 VCF file and created the format as u mentioned above

I will check the v7 vcf file once and get back to you

Thank You

FelixKrueger commented 3 years ago

I am currently taking a look at the v7 file (mgp_REL2005_snps_indels.vcf.gz)

from what I can see already it looks like the information we need is all there:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##bcftoolsCommand=mpileup -Ou -g 10 -a FORMAT/DP,FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/SP,INFO/AD -E -Q 0 -pm 3 -F 0.25 -d 500 -f Mus_musculus.GRCm38.68.dna.toplevel.fa
##bcftools_callCommand=call -mAv -f GQ,GP -p 0.99
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Phred-scaled Genotype Quality">
##FORMAT=<ID=FI,Number=1,Type=Integer,Description="Whether a sample was a Pass(1) or fail (0) based on FILTER values">
##FILTER=<ID=LowConfidence,Description="Set if not true: FORMAT/FI[*]=1">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##VEP="v97" time="2020-05-08 02:30:32" cache="/nfs/production/mousegenomes/bin/ensembl-vep-release-97.2/mus_musculus/97_GRCm38" ensembl-funcgen=97.24f4d3c ensembl-variation=97.e9e14e5 ensembl=97.378db18 ensembl-io=97.dc917e1 assembly="GR
Cm38.p6" dbSNP="150" gencode="GENCODE M22" genebuild="2012-07" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Ami
no_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|SIFT">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  129P2_OlaHsd    129S1_SvImJ     129S5SvEvBrd    A_J     AKR_J   B10.RIII        BALB_cByJ       BALB_cJ BTBR_T+_Itpr3tf_J       BUB_BnJ C3H_HeH C3H_HeJ C57BL_10J   C57BL_10SnJ      C57BL_6NJ       C57BR_cdJ       C57L_J  C58_J   CAST_EiJ        CBA_J   CE_J    CZECHII_EiJ     DBA_1J  DBA_2J  FVB_NJ  I_LnJ   JF1_MsJ KK_HiJ  LEWES_EiJ       LG_J    LP_J    MA_MyJ  MOLF_EiJ        NOD_ShiLtJ      NON_L
tJ      NZB_B1NJ        NZO_HlLtJ       NZW_LacJ        PL_J    PWK_PhJ QSi3    QSi5    RF_J    RIIIS_J SEA_GnJ SJL_J   SM_J    SPRET_EiJ       ST_bJ   SWR_J   WSB_EiJ ZALENDE_EiJ

I'll take a quick look to see how hard it would be to adapt the process...

Hemantcnaik commented 3 years ago

Okay I will check with this vcf file

FelixKrueger commented 3 years ago

I have now updated SNPsplit_genome_preparation, and while it is still running as we speak, I think it might be working! (addressed here: 9a7c5b1b47e1fe1f95ee2c97275109b01f1444c7).

Can you please download the v7 version of the VCF file, clone the latest development version of SNPsplit (git clone https://github.com/FelixKrueger/SNPsplit.git), and then try this command:

SNPsplit_genome_preparation --v7 --vcf mgp_REL2005_snps_indels.vcf.gz --ref  /Genomes/Mouse/GRCm38/ --strain 129S1_SvImJ --strain2 JF1_MsJ

I will update again once the process is complete.

FelixKrueger commented 3 years ago

It seems to have worked, just like that!

Strain 1 (129S1_SvImJ)

SNP position summary for strain 129S1_SvImJ (based on genome build GRCm38)
===========================================================================

Positions read in total:        92510146

12641182        Positions were INDELs (and hence skipped)
5521403 SNP were homozygous. Of these:
5220884 SNP were homozygous and passed high confidence filters and were thus included into the 129S1_SvImJ genome

Not included into 129S1_SvImJ genome:
73117174        had the same sequence as the reference
0               had no clearly defined alternative base
1230387         Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
300519          were homozygous but the filtering call was low confidence

Now printing a single list of all SNPs to >all_SNPs_129S1_SvImJ_GRCm38.txt.gz<...

Strain 1 (JF1_MsJ)

SNP position summary for strain JF1_MsJ (based on genome build GRCm38)
===========================================================================

Positions read in total:        92510146

12641182        Positions were INDELs (and hence skipped)
21075575        SNP were homozygous. Of these:
19626002        SNP were homozygous and passed high confidence filters and were thus included into the JF1_MsJ genome

Not included into JF1_MsJ genome:
55512115        had the same sequence as the reference
0               had no clearly defined alternative base
3281274         Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
1449573         were homozygous but the filtering call was low confidence

Now printing a single list of all SNPs to >all_SNPs_JF1_MsJ_GRCm38.txt.gz<...

Dual-hybrid generation

Looked at positions from new Reference strain [129S1_SvImJ]:            5220884
Compared positions from new SNP strain [JF1_MsJ]:               19626002
======================================================
SNPs were the same in Ref and SNP genome (not written out):     2385893
SNPs were present in both Ref and SNP genome but had a different sequence:      16400
SNPs were low confidence in one strain and thus ignored:        442915
SNPs were unique to Ref [129S1_SvImJ]:                          2521111
SNPs were unique to SNP [JF1_MsJ]:                              17078274

Changing the genomic reference sequence to the full sequence of strain 129S1_SvImJ

Reading and storing all new SNPs with Ref/SNP: 129S1_SvImJ/JF1_MsJ from 'all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt'

Summary
19615785 Ns were newly introduced into the N-masked genome for strainstrain 2 [129S1_SvImJJF1_MsJ] in total
19615785 SNPs were newly introduced into the full sequence genome version for strainstrain 2 [129S1_SvImJ/JF1_MsJ] in total

All it needs now should be indexing of the N-masked genome using your favourite aligner (except if your favourite aligner is BWA...), and SNP is your uncle!

Hemantcnaik commented 3 years ago

Hello Felix

Thank you, for your support It ran smoothly.

I am indexing data from the folder 129S1_SvImJ_JF1_MsJ_dual_hybrid.based_on_GRCm38_full_sequence,
Aligner I am using bowtie2 its a paired end data Next SNPsplit for 129S1 SNPsplit --paired --snp_file N_masked_data/all_SNPs_129S1_SvImJ_GRCm38.txt.gz --input data1.bam -o Result_129S1

for JF1 SNPsplit --paired --snp_file N_masked_data/all_SNPs_129S1_SvImJ_GRCm38.txt.gz --input data1.bam -o Result_JF1

The above code is looking fine or is there any changes I have to do Thank you

About alignment I have used bowtie2 with default parameter if you experience with it is there any specific parameter I can use please mention its a ChIP-seq data set

Thank you

FelixKrueger commented 3 years ago

Hi Hemant,

No, you will need to index the files in folder 129S1_SvImJ_JF1_MsJ_dual_hybrid.based_on_GRCm38_N-masked, as you will need to align against an N-masked genome (and not the full name). A command for indexing should look like this:

bowtie2-build --threads 4 chr10.N-masked.fa,chr11.N-masked.fa,chr12.N-masked.fa,chr13.N-masked.fa,chr14.N-masked.fa,chr15.N-masked.fa,chr16.N-masked.fa,chr17.N-masked.fa,chr18.N-masked.fa,chr19.N-masked.fa,chr1.N-masked.fa,chr2.N-masked.fa,chr3.N-masked.fa,chr4.N-masked.fa,chr5.N-masked.fa,chr6.N-masked.fa,chr7.N-masked.fa,chr8.N-masked.fa,chr9.N-masked.fa,chrMT.N-masked.fa,chrX.N-masked.fa,chrY.N-masked.fa 129S1_SvImJ_JF1_MsJ_dual_hybrid_N-masked

Then you will only need a single command to split the data allele-specifically, something like this:

SNPsplit --snp_file all_JF1_MsJ_SNPs_129S1_SvImJ_reference.based_on_GRCm38.txt your_aligned_file.bam

In theory, SNPsplit should detect that the file is paired-end. The files in ...genome1.bam will be specific for ...129S1_SvImJ, and genome2.bam will be specific to JF1_MsJ.

FelixKrueger commented 3 years ago

We personally add --no-discordant --no-mixed to our Bowtie2 alignments so that only concordant paired-end alignments are output, but that is a matter of personal preference. Good luck!

Hemantcnaik commented 3 years ago

Thank you, for your suggestions ....

FelixKrueger commented 3 years ago

Closing this to prepare a new release.