cibiobcg / EthSEQ

Ethnicity Annotation from Whole-Exome and Targeted Sequencing Data
14 stars 10 forks source link

VCF file parsing issue #8

Open yangyxt opened 3 years ago

yangyxt commented 3 years ago

I got this error message: no DISPLAY variable so Tk is not available [2021-02-16 16:01:55] Load genotype data in VCF file: /paedyl01/disk1/yangyxt/public_data/1000g/1000g_phase3_from_fei/samples_used_to_build_ref_model/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.unrelated.vcf.gz [1] FALSE Warning messages: 1: In fread(geno, data.table = FALSE) : Detected 1 column names but the data has 2 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that create 2: In fread(geno, data.table = FALSE) : Stopped early on line 229. Expected 2 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">>>

It seems the header line describing the GT definition cannot have more than 2 fields seperated by comma? Or should I just abandon the header line to input VCF body content only?

Here is how my VCF file looks like (from 1000g): image

mike8115 commented 3 years ago

From my experience so far, you need to drop all but the last line of the header. It doesn't look like the script makes any attempt at parsing the header section, so it tries to treat the header as actual data.