Illumina / hap.py

Haplotype VCF comparison tools
Other
402 stars 122 forks source link

PreProcessing creates un-ordered VCF #163

Open MattWellie opened 1 year ago

MattWellie commented 1 year ago

Error message:

Exception: Command line bcftools index -t /tmp/tmpIg13Dm.vcf.gz got return code 255.STDOUT: STDERR: [E::hts_idx_push] unsorted positions on sequence #5: 38598232 followed by 38598227index: failed to create index for "/tmp/tmpIg13Dm.vcf.gz"
Error running BCFTOOLS; please check your file for compatibility issues issues using vcfcheck

This is caused by running the following command

hap.py {truth_vcf.gz} {input_vcf.bgz} -r {path_to_fasta} -R {path_to_bed} -o {path_to_output} --engine-vcfeval-path={path_to_rtg} --threads 10 --engine-vcfeval-template {path_to_sdf} --engine=vcfeval --preprocess-truth

The pre-processing of the truth VCF (with tons of logging lines as noted in #129), but fails during the processing of the input VCF. The sites in question (chr5:38598232, chr5:38598227) are correctly ordered in the query VCF before processing, and are 2 deletions. It seems like at some point during the pre-processing the sites are becoming disordered, possibly as a result of the ref alleles overlapping. Without including any additional VCF content, the 2 lines as they appear in the sorted, indexed, input VCF:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr5    38598227    rs68118711  TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC  T
chr5    38598232    rs373607072;rs71963527  TATTTATTTA  T

If this is a result of decomposition, could this be resolved by adding in a bcftools sort call prior to the index?

This issue appears to be the same as #109, but a fix was proposed for that issue on a branch that no longer exists, and the issue is closed.

MattWellie commented 1 year ago

I've solved this problem myself by updating the bcftools version within the image

The issue is that the BCFtools 1.4.1 bundled as an external tool here seems to have a dumb BED file filtering process, where each region in the BED file is processed in turn, and variants are appended to the output. Relevant regions from my BED file are:

chr5    38598232    38598244
chr5    38598259    38602497

In this example both variants overlap with the first interval, and the first (sorted by POS) ends in the second interval so it is being added twice: (example salvaged from the temp VCF generated by bcftools.py's preProcessVCF)

chr5    38598227    rs68118711  TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC  T   ...
chr5    38598232    rs373607072;rs71963527  TATTTATTTA  T   ...
chr5    38598227    rs68118711  TTTTTTATTTATTTATTTATTTTTTTTTTTTTTC  T   ...

PLEASE PLEASE PLEASE update the bundled tool binaries in this repository, the SAMtools/BCFtools are 5 years old now