TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
83 stars 13 forks source link

export with -m (merge) option #652

Open lynnjo opened 5 months ago

lynnjo commented 5 months ago

Hello -

I am using tiledbvcf to create a dataset that I would later like to be able to export as a merged vcf file. I can successfully, load and export data from this dataset. What I would like to do is export to a multi-sample vcf file. It looks like export with the -m option should handle this, though it gives me memory errors. I added the -b flag to increase this but still no luck. The command I am running:

tiledbvcf export --uri tiledb_datasets/gvcf_dataset  -m -b 65536 -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf

The error I get:

Exception: SubarrayPartitioner: Trying to partition a unary range because of memory budget, this will cause the query to run very slow. Increase `sm.memory_budget` and `sm.memory_budget_var` through the configuration settings to avoid this issue. To override and run the query with the same budget, set `sm.skip_unary_partitioning_budget_check` to `true`.

Is there another trick to running the tiledbvcf export command to create a merged vcf? Thank you

I am running tiledbvcf version:

phgv2-conda) [lcj34@cbsubl01 phg_v2]$ tiledbvcf --version
TileDB-VCF version 0f72331-modified
TileDB version 2.16.3
htslib version 1.16

My machine is a linux, these specifics:

NAME="Rocky Linux"
VERSION="9.0 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.0"
gspowley commented 5 months ago

Hi @lynnjo,

Please try adding the following --tiledb-config options to your export command, which will increase sm.memory_budget to 10GiB, sm.memory_budget_var to 20GiB, and skip the memory budget check.

tiledbvcf export \
  --uri tiledb_datasets/gvcf_dataset  \
  -m -b 65536 \
  -o /workdir/lcj34/phg_v2/exportedHvcfs/mergedGvcf.vcf \
  --tiledb-config sm.memory_budget=10737418240,sm.memory_budget_var=21474836480,sm.skip_unary_partitioning_budget_check=true

The export may be slow, as reported by the original error message, because we have not optimized the performance of exporting a merged VCF yet.

lynnjo commented 5 months ago

Thanks @gspowley - I will try the above.

Do I still keep the "-b 65536" flag while adding the last line you show?

One more question: We note that GATK can export a multi-sample vcf using the "gatk -GenomeGVCFs -V genodb://" and that is relatively fast. I know tiledbvcf originated as genomicsDB. Is the reason this works from GATK due to GATK doing some of the work to merge the files?

gspowley commented 5 months ago

Yes, keeping the -b 65535 option will improve the export performance, assuming your system has enough memory. The memory budget parameters may need some tuning based on your dataset and system resources.