FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

How to use a self prepared VCF ? #25

Closed hmyh1202 closed 5 years ago

hmyh1202 commented 5 years ago

I had do WGS for my samples, and how to reform the vcf that suit for the SNPsplit ?

FelixKrueger commented 5 years ago

While I'm afraid I can't offer a service to convert arbitrary files into a usable format, I can quickly let you knw what the your options are:

  1. You could bring your VCF file, and popssibly SNPsplit_genome_preparation, into a format that looks pretty much the same as the VCF file provided by the Mouse Genomes Project. The steps are described in some detail here: https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md#filtering-and-processing-high-confidence-snps-from-the-vcf-file. I think you will need the strain name in a line starting with #CHROM, the chromosomes as ##contig= <ID=... fields, and finally you will need to adjust the STRAIN field so that SNPsplit_genome_preparation can find the GT and FI fields. If you manage to do this, the genome preparation would work as a single command.

  2. The next option would be to skip reading the VCF file, and produce instead a subfolder containing plain text files with the SNP information for each chromosome. If you want to use SNPsplit_genome_preparation, the folder needs to be named SNPs_YOUR_STRAIN_NAME, so for CAST_EiJ it needs to look like this:

ls SNPs_CAST_EiJ
chr10.txt  chr11.txt  chr12.txt  chr13.txt  chr14.txt  chr15.txt  chr16.txt  chr17.txt  chr18.txt  chr19.txt  chr1.txt  chr2.txt  chr3.txt  chr4.txt  chr5.txt  chr6.txt  chr7.txt  chr8.txt  chr9.txt  chrMT.txt  chrX.txt  chrY.txt

The text files for the SNPs require this format:

>10
42459235        10      3101362 1       G/C     1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276        10      3102112 1       T/C     1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298        10      3102661 1       C/G     1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325        10      3103479 1       A/G     1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1

If you have such files, you can run SNPsplit_genome_preparation with the option --skip_filtering:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:

                                          SNP-ID     Chromosome  Position    Strand   Ref/SNP
                              example:   33941939        9       68878541       1       T/G
  1. And lastly, you could 'simply' create your own version of an N-masked genome in any other way. The only thing you will need to give to SNPsplit later is a list of all SNP calls it might encounter. The format is the same as the one above. See also here: https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md#3-storing-snp-positions.

I hope this helps.

FelixKrueger commented 5 years ago

The format of the SNP file looks alright at first glance (note that this is not a VCF file).

In order to use SNPsplit in conjunction with this SNP file, you will have to have constructed an N-masked genome first, which then must have been indexed with and used for the alignment step. If you have done that (I can't see this from your note), it should work indeed.

hmyh1202 commented 5 years ago

OK thank you very much.

FelixKrueger commented 5 years ago

I guess that would be the right order. If you are using deduplicate_bismark for the N-masked BAM file then the fiels are already in the correct format ("sorted by read name").

4 + 5: Yes, we tend to call the "VCF" file SNP file (as it really isn't a VCF file at all).

Good luck! Felix

FelixKrueger commented 5 years ago

So in your example, the reference sequence is A, and you've got a T for the paternal allele, and a G for the maternal allele. This means you will need to replace the A by an N, and write out that position as G/T in the SNP file.

Later on, when SNPsplit detects an N in the alignment it will look up whether the sequence was either G or T, and sort it to the maternal or paternal allele, respectively. If it finds A at the position (the reference sequence), I believe it will indeed just ignore that position.

If you have a position where only one parent has a SNP relative to the reference, then also N-mask that position, and add that position to the SNP file as REF/SNP or SNP/REF, depending on which allele the SNP was on.

Does that make sense?

hmyh1202 commented 5 years ago

Another question for alignment using N-masked genome index is should I adjust the parameters of bismark, or any other matters need attention. For I got a lower mapping ratio using N-mask genome index.

FelixKrueger commented 5 years ago

I suppose you could slightly increase the --score_min function compared to standard mapping, but N should only get a penalty of -1 (as opposed to -6 for other base mismatches), so it shouldn't be all that bad. Maybe --score_min L,0,-0.4?