fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
353 stars 47 forks source link

VCF header fields missing #189

Open Krannich479 opened 1 year ago

Krannich479 commented 1 year ago

Hello Fritz & SURVIVOR dev team,

What.

I attempted to use the VCF file generated by SURVIVOR with bcftools, as for instance recommended in issue https://github.com/fritzsedlazeck/SURVIVOR/issues/173. However, bcftools suffers from a bug that originates from SURVIVOR I think.

Error.

When using bcftools (sort+index) on SURVIVOR's truthset VCF, I get warnings and an error regarding the VCF header not matching the FORMAT fields.

bcftools sort -o simulated.sorted.vcf.gz -O z simulated.vcf; bcftools index -t simulated.sorted.vcf.gz

Writing to /tmp/bcftools-sort.9eyTIT
[W::vcf_parse_format] FORMAT 'GL' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'GQ' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'FT' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RC' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'DV' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RR' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'RV' is not defined in the header, assuming Type=String
[E::bcf_write] Unchecked error (2), exiting

I looked into the code at https://github.com/fritzsedlazeck/SURVIVOR/blob/ed1ca5188a2d9286d582417b7a65938c768df995/src/simulator/SV_Simulator.cpp#L951 where the FORMAT field for SNP variant records is written. In case the print_vcf_header2 function above is the corresponding header than the FORMAT fields indeed do not match.

Solution.

I propose two trivial solutions here:

  1. Removing the FORMAT fields from the SNP variant records.
  2. Adding the FORMAT fields to the function that generates the header.

I am voting for solution 1 here because: a) I think FT, RC, DR, DV, RR, RV are ancient relics of Lumpy unrelated to SNPs. Also these FORMAT fields have been commented out throughout most of SURVIVOR. b) The missing fields are not part of the VCF4.2 standard and should not be present if not defined and used. c) I tested that the VCF file generated by SURVIVOR works flawlessly with bcftools if the fields are removed from SNP records. (see hotfix below)

Hotfix.

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf where simulated.vcf is the VCF by SURVIVOR.

rl4940 commented 3 months ago

I am suffering the exact same problem here as you were. But I did not set SNP cuz I am not intended to study SNP, only SVs are of my interest. So does it mean that I can use the:

sed -i 's/GT:GL:GQ:FT:RC:DR:DV:RR:RV/GT/g ' simulated.vcf

To fix it? Thanks

rl4940 commented 3 months ago

OK I figured it out, besides that line of code, we also need to add headers like this manually:

##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise variant">
Krannich479 commented 3 months ago

Hi @rl4940, the issue I had, the hotfix I proposed (in this issue) as well as the PR #190 aim for a format correction of SNV records only. However, I hypothesize that if your callset does not include the additional fields (GQ:FT:RC:DR:DV:RR:RV) it's probably safe to use the sed command from above.


(disclaimer: I am not a maintainer of SURVIVOR, I just tried to fix a bug here)