divonlan / genozip

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too
Other
159 stars 12 forks source link

Issue with Allele Count Exceeding 99 in UKBB WGS VCF Files #32

Open Zhangliubin opened 2 weeks ago

Zhangliubin commented 2 weeks ago

Dear Genozip Team,

I am currently working with the UK Biobank (UKBB) Whole Genome Sequencing (WGS) dataset, which includes approximately 490,000 samples and over a billion variant sites. During the compression process of a VCF.GZ file using genozip, I encountered the following error:

genozip chr21.samples_1.hg38.vcf.gz : 25% (0 seconds) Error vcf_seg_FORMAT_in variant 21:9028016: VCF file sample 1 - genozip currently supports only alleles up to 99 The error occurs when processing variant sites that have more than 99 alleles, which, in some rare cases, exceed 300 alleles per site.

Dataset: UKBB WGS data, with 490,000 samples and 1 billion variant sites.

I could not find any options in the documentation to handle such variant sites with a high allele count. Specifically, I would like to know: Is there an option to split these multi-allelic variant sites into multiple biallelic sites? Or is there an option to discard variants with an excessive number of alleles (e.g., more than 99)?

I appreciate your help in resolving this issue and would be grateful for any guidance or potential solutions.

Thank you for your time and support.

Best regards, Liubin Zhang