Some questions about working on plant species

Hello Felix,

I'm working on hybrid maize and also want to identify reads from specific genome. Is SNPsplit suitable for the species? I have mapped the paried-end reads to the maize genome by using Hisat2, and the call the alleles (stored in VCFv4.2 format) by freebayes. I followed the instruction you mentioned in the document. First, I prepared the genome:

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain maize

But it turned out:

Strain name specified [maize] does not match any of the available strain names!

Available genomes to choose from are:
=====================================
unknown
=====================================

Please double check the name and try again (using '--strain NAME')

It may means I have set uncorrected genome name, then I found "unknown" in line #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown in the vcf file, then I ran the command again with --strain from "maize" to "unknown", :

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa --strain unknown

The exception came out:

Strain defined as 'unknown' (strain index: 9)
Failed to move to genome folder > Zea_mays.B73_RefGen_v4.50.dna.toplevel.fa/ <: Not a directory

SNPsplit_genome_preparation --help for more details

It may means I should put the genome in a fold, so I reran the command as below with the genome in maize_genome fold:

SNPsplit_genome_preparation --vcf B73_Ki11_all.ref_B73.vcf --reference_genome maize_genome --strain unknown

But there are so many exceptions like below:

Use of uninitialized value $fi in numeric eq (==) at /data/CrazyHsu_data/opts/SNPsplit_genome_preparation line 1041, <IN> line 125001.

There are many files generated in SNPs_unknown, but they are all empty. And the content of __unknown_SNP_filtering_report.txt__ is:

SNP position summary for strain unknown (based on genome build GRCm38)
===========================================================================

Positions read in total:        302818

47016   SNP were homozygous. Of these:
0       SNP were homozygous and passed high confidence filters and were thus included into the unknown genome

Not included into unknown genome:
1296    had the same sequence as the reference
0               had no clearly defined alternative base
254501          Calls were neither 0/0 (same as reference) or 1/1, 2/2, 3/3 (homozygous SNP)
47016           were homozygous but the filtering call was low confidence

Printed a single list of all SNPs to >all_SNPs_unknown_GRCm38.txt.gz<...

Head 10 line of __unknown_genome_preparation_report.txt__ is:

0 positions on chromosome 1 were changed to 'N'
0 positions on chromosome 10 were changed to 'N'
0 positions on chromosome 2 were changed to 'N'
0 positions on chromosome 3 were changed to 'N'
0 positions on chromosome 4 were changed to 'N'
0 positions on chromosome 5 were changed to 'N'
0 positions on chromosome 6 were changed to 'N'
0 positions on chromosome 7 were changed to 'N'
0 positions on chromosome 8 were changed to 'N'
0 positions on chromosome 9 were changed to 'N'

Is there anything wrong when I use the SNPsplit_genome_preparation? And how can I do?

Best, Feng

HI @CrazyHsu

The SNPsplit genome preparation is designed to work with VCF files provided by the Mouse Genomes Project, either mgp.v5.merged.snps_all.dbSNP142.vcf.gz, and as of last week also the new v7 version (see #44). This section of the User Guide explains in details which sections of the v5 Mouse VCF file are important for this process: https://github.com/FelixKrueger/SNPsplit/blob/master/SNPsplit_User_Guide.md#filtering-and-processing-high-confidence-snps-from-the-vcf-file

SNPsplit is really designed to work with hybrid diploid genomes where the genomes of both parental strains is known. As an example, when genome 1 is called you know that the reads must have come from the maternal strain, and reads specific for genome 2 came from the paternal strain. In the scenario where you perform variant calling on the data, and then use this as the basis for an N-masked genome, you might be able to calle allele1 or allele2, but you can probably not tell wether it is the paternal or maternal genome. All you can look at is allelic imbalance, unless you have completely phased genotypes.

There a also a few more considerations:

for a variant calling approach such as yours, you don't want to use homozygous SNPs which are being used for the mouse strains. Instead, you will need to look at high confidence heterozygous SNPs!
Not sure if the tally you linked above is correct, but is seems that there are ~250,000 SNPs where the genome call is not homozygous (which is probably the SNPs you are after): If we assume that the maize genome is ~2.7GB large, then you would only expect 1 SNP every 10kb or so - so the coverage will probably be very very sparse indeed....

Having said that, in theory SNPsplit should be able to handle this as well, but it will require some work in order to get there.

Your options are:

instead of using SNPsplit_genome_preparation to parse your VCF file, you could filter the VCF file yourself and generate a folder with the heterozgous SNPs. In theory you should be able use this information to construct a genome, but you will need to adhere to the same rules that the script uses internally.

Let's say you call your straing maize: --strain maize

Next, the SNPs will need to be contained within a folder called:

SNPs_maize

inside there, the script expects one file per chromosome, e.g. like so:

chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrMT.txt
chrX.txt
chrY.txt

So you will have to split your VCF SNP by chromosome. And finally, these file need to adhere to the following format:

>10
42459235        10      3101362 1       G/C     1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276        10      3102112 1       T/C     1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298        10      3102661 1       C/G     1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325        10      3103479 1       A/G     1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1
...

where the first line is the name of the chromosome (like in a FastA file).

this is also explained in the option:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:
                                      SNP-ID     Chromosome  Position    Strand   Ref/SNP
                          example:   33941939        9       68878541       1       T/G

I believe column 6 is optional, but the first 5 need to be present. It is sufficient to put 1 as strand if the sequence is based relative to the top strand. So you will need to reformat your file just a little, and you should be able to run it.

If the number of chromosomes in maize is different, you need to change that accordingly. And finally, you will also need to edit the SNPsplit_genome_preparation, search for '$skip_filtering' and find the position where the chromosomes are hard-coded. Edit that as well to work for your genome.

The other option is that you go through the SNPsplit_genome_preparation and change the logic in there so that all high confidence heterozygous SNP calls get used instead of homozygous ones.

I am afraid both approaches will require some amount of tinkering...

FelixKrueger / SNPsplit

Some questions about working on plant species #45