changlubio / GenomePrep

To preprocess, quality control and prepare consumer DTC genomes for research
BSD 3-Clause "New" or "Revised" License
14 stars 2 forks source link

Produces invalid VCF #2

Closed CholoTook closed 2 years ago

CholoTook commented 2 years ago

Hi,

I ran the produced VCF through the VCF validator (debugulator) here: https://github.com/EBIvariation/vcf-validator

Unfortunately it gave the following error:

According to the VCF specification, the input file is not valid
Error: The fileformat declaration is not valid (the line must start with 
##fileformat= and the value must be one of 'VCFv4.1', 'VCFv4.2' or 'VCFv4.3').
This occurs 1 time(s), first time in line 1.

I added ##fileformat=VCFv4.1

There are a few more errors after I've fixed this one:

This occurs 1 time(s), first time in line 25.

I added ##reference=GCF_000001405.13

I then added the contigs:

##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT>

Now I get 803 errors like this:

Error: Duplicated variant 1:2526746:A>G found. This occurs 2 time(s), first time in line 468.

which relates to these rows:

1   2526746 rs3748816   A   G   .   .   .   GT  1/1
1   2526746 i6059967    A   G   .   .   .   GT  1/1

From the spec I think this should be:

1   2526746 rs3748816;i6059967  A   G   .   .   .   GT  1/1

I 'fixed' these using the debugulator.

Now I notice:

Warning: FORMAT 'GT' is not listed in a valid meta-data FORMAT entry. This occurs 948928 time(s), first time in line 51.

So I added:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

It then complains about all 'non standard' comments, which I don't think is correct... However, I stripped them out.

The final header I created looks like this:

##fileformat=VCFv4.1
##fileDate=16-02-22
##source=GenomePrep v1.1,company=23andme,usebadallele=c4,maxcov=1.0,from_valid_snps=959109
##reference=GCF_000001405.13
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=Assembly>
##contig=<ID=11,length=Assembly>
##contig=<ID=12,length=Assembly>
##contig=<ID=13,length=Assembly>
##contig=<ID=14,length=Assembly>
##contig=<ID=15,length=Assembly>
##contig=<ID=16,length=Assembly>
##contig=<ID=17,length=Assembly>
##contig=<ID=18,length=Assembly>
##contig=<ID=19,length=Assembly>
##contig=<ID=20,length=Assembly>
##contig=<ID=21,length=Assembly>
##contig=<ID=22,length=Assembly>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  90365083240

I'm pointing this out because I'm seeing an error when uploading the VCF here: https://imputationserver.sph.umich.edu/index.html#!pages/home

and I get the fatal error:

Input Validation
The provided files are not VCF files (see Help).
changlubio commented 2 years ago

Hi there,

The header file was messy, thanks for pointing it out. It was originally simply copied from the consumer genome metadata, but these don't fit the VCF format.

The fatal error in the end was caused by duplicate lines from the consumer genome data (seen in many cases in 23andMe output, but they usually have the same genotyping calls just different rsids).

I have fixed the bugs above and tested using vcf_validator. Let me know if it fails again :)

Cheers C