edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

KeyError: '2': always met this error when converting a vcf file to phylip file. #49

Closed mxtu97 closed 6 months ago

mxtu97 commented 6 months ago

I have modified the variety names to be four digits, such as 4111, ensuring they are within 10 characters. However, I am still encountering this error. Could you please teach me how to resolve it?

$ python3 vcf2phylip-2.8/vcf2phylip.py -i test.vcf

Converting file 'test.vcf':

Number of samples in VCF: 290 Traceback (most recent call last): File "vcf2phylip-2.8/vcf2phylip.py", line 502, in main() File "vcf2phylip-2.8/vcf2phylip.py", line 316, in main site_tmp = get_matrix_column(record, num_samples, File "vcf2phylip-2.8/vcf2phylip.py", line 129, in get_matrix_column column += AMBIG[geno_nuc] KeyError: '2'

edgardomortiz commented 6 months ago

Hi @mxtu97 ,

I need more information, there is no "variety" field in the VCF format, what did you change in your VCF?. Anyway, the error is not related to that, I guess you have some malformed genotypes. If you can share a few thousand lines from your VCF I might be able to help...

Edgardo

mxtu97 commented 6 months ago

Yeah, of course. This is the first few lines of the VCF file, and the variety names are not listed here completely.

fileformat=VCFv4.2

fileDate=20240320

source=PLINKv1.90

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

INFO=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R4157 R4158 R4163 R4168 R4171 R4176 R4177 R4179 R4180 R4181 R4182 R4187 R4191 R4194 R4199 R4202 R4210 R4213 R4215 R4218 R4219 R4220 R4222

1 65427 1_65427 2 1 . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/0 0/0 1 110988 1_110988 2 1 . . PR GT 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/1 0/0 1/1 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1 124600 1_124600 2 1 . . PR GT 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/0 1 124713 1_124713 2 1 . . PR GT 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/0 0/0 0/0 0/1 0/0 1 124751 1_124751 2 1 . . PR GT 0/0 0/0 0/1 1/1 0/1 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/1 1/1 0/0 0/0 0/1 1/1 1/1 1 124781 1_124781 2 1 . . PR GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/0 1 124787 1_124787 2 1 . . PR GT 0/0 0/0 0/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/0 1 125023 1_125023 2 1 . . PR GT 0/0 0/0 1/1 1/1 0/1 0/0 0/0 0/0 0/1 1/1 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/1 1/1 0/0

edgardomortiz commented 6 months ago

As I suspected your genotypes have a non-standard format, fields 3 and 4 (REF and ALT) must be nucleotides and not numbers, see here:

https://en.m.wikipedia.org/wiki/File:Binary_BCF_versus_VCF_format.png

Interesting, how did you generate this VCF?

Edgardo

mxtu97 commented 6 months ago

The problem was solved, indeed because of the problem of genotypes. Thank.

edgardomortiz commented 6 months ago

Great, don't hesitate to ask if you find a new issue...

Edgardo