PacificBiosciences / HiPhase

Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
Other
71 stars 4 forks source link

Error while parsing VCF file: FORMAT columns #27

Closed hangsuUNC closed 9 months ago

hangsuUNC commented 9 months ago

Hi Matt,

I was running Hiphase on a vcf and have an error for a pbsv call. I manually checked this record and didn't see anything obviously wrong. [E::vcf_parse_format] FORMAT column with no sample columns starting at chr1:80764910 [2024-02-09T19:59:04.480Z ERROR hiphase] Error while parsing VCF file: invalid record in BCF/VCF file

The format columns of this call is here: chr1 80764910 pbsv.INS.2090 GT:AD:DP:SUPP ./.:.:.:.

In addition, I saw this error before for other pbsv call sets. Is hiphase requires all the records have the same FORMAT tags? Could you provide some suggestions about the error and what does this mean?

Best regards,

Hang Su

holtjma commented 9 months ago

Hi Hang Su,

So the error is basically getting passed through from rust_htslib which does all the VCF and BAM parsing for HiPhase. This is just a wrapper for htslib. Both of these libraries are very well tested, and they typically do not throw errors like this unless something is poorly formatted. To date, every time I've seen this in HiPhase, it was because of a file that was not following specifications.

Assuming what you provided is verbatim accurate, then it appear to be missing some fields, namely REF, ALT, QUAL, FILTER, and INFO. You may have just filtered these for the initial message, but I'm not sure. As for FORMAT, VCF spec says every sample in the file must have a GT entry, but can drop trailing fields if they're empty.

I think I have few questions to help debug the problem:

  1. How did you generate the file? And are you using the latest version of pbsv? I ask because so far I have not seen this error come directly from pbsv output, but rather after some file merging/manipulation/etc. has been performed on it.
  2. Can you provide the full line VCF entry that it says is throwing the error?
  3. Have you tried seeing what other VCF tools like bcftools say about the file? When I've encountered this issue in the past, it typically was not constrained to HiPhase. This will also help you identify if it's an error in the file formatting.

Matt

hangsuUNC commented 9 months ago

Hi Matt,

Thanks for your reply!

I found the reason: there are calls from pbsv that with no GT tag. Also some calls have less columns e.g. with no sample information in the records.

I generate the file by using bcftools view -s "samplename", splitting from a large Joint SV call set. pbsv is one of the caller and I did some preprocessing for the calls, e.g. converting the lower cases to upper cases, deleting the records with no GT tags. The bcftools is fine with those records, but hiphase fails due to format issues...

Thanks,

Hang

holtjma commented 9 months ago

Great, it sounds like you've figured out the formatting issue! Closing this since it sounds resolved, feel free to re-open if you encounter it again!