TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
88 stars 15 forks source link

Cannot submit_and_finalize query #663

Open dotinspace opened 7 months ago

dotinspace commented 7 months ago

Hi! I am trying to ingest samples, but run into the following problem:

_40b48cf415984c70a4ab49e225d7c8ef_21 samples = SAMPLE
[2024-02-13 09:59:31.061] [tiledb-vcf] [...] [debug] Finalizing last contig batch of [1, 1]
[2024-02-13 09:59:31.061] [tiledb-vcf] [...] [debug] AlleleCount: Finalize query with 0 records
[2024-02-13 09:59:31.104] [tiledb-vcf] [...] [debug] VariantStats: Finalize query with 0 records
[2024-02-13 09:59:31.146] [tiledb-vcf] [...] [debug] Query buffer for 'contig' contains 3272 elements
[2024-02-13 09:59:31.146] [tiledb-vcf] [...] [critical] Cannot submit_and_finalize query with buffers set.

As far as I can tell, one of the following steps fails. Is there something I can test tweaking to make this work, or does anything else stand out as the obvious culprit here?

    File: libtiledbvcf/src/stats/allele_count.cc
 160   if (contig_records_ > 0) {
 161     if (utils::query_buffers_set(query_.get())) {
 162       LOG_FATAL("Cannot submit_and_finalize query with buffers set.");                                                                                                                                      
 163     }
 164     query_->submit_and_finalize();
 File: libtiledbvcf/src/stats/variant_stats.cc
 158   if (contig_records_ > 0) {
 159     if (utils::query_buffers_set(query_.get())) {
 160       LOG_FATAL("Cannot submit_and_finalize query with buffers set.");                                                                                                                                      
 161     }
 162     query_->submit_and_finalize();
gspowley commented 7 months ago

Hi @dotinspace,

This error looks like a data dependent edge case related to the AlleleCount and VariantStats stats having 0 records.

To work around the issue, please create the dataset with AlleleCount and VariantStats disabled:

tiledbvcf create --disable-allele-count --disable-variant-stats ...

If you can share the VCF file ingested, it would help us debug the issue (I know that is not always possible). Otherwise, we will try to reproduce the condition that causes this error.

dotinspace commented 7 months ago

Hi, thanks for the swift response.

The multisample VCF, ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz, was taken from 1000Genomes. Then split with vcf-split, and subsequently block compressed (bgzip) and indexed (tabix), before being ingested into TileDB-VCF dataset. I wouldn't be surprised if the VCF files, or the process of splitting, might cause some issue with those two stats arrays. Unfortunately, for our purposes, currently, we are testing by utilising variant_stats.

Anyway, nice to know what is going on for future reference.