broadinstitute / palantir-workflows

Utility workflows for the DSP hydro.gen team (formerly palantir)
BSD 3-Clause "New" or "Revised" License
17 stars 8 forks source link

Disk size in PerformPopulationPCA #147

Closed fabio-cunial closed 1 year ago

fabio-cunial commented 1 year ago

The default value of disk_size in tasks SeparateMultiallelics and SubsetToArrayVCF makes them crash on the Thousand Genomes hg38 joint VCF (260 GB). Setting disk_size=1000 fixes the issue, but at the cost of having waited for a long time before the first crash. Maybe it's possible to define a better default estimate of disk_size?

Task LDPruning has a hardcoded disk size, which makes it crash on the VCF above. Disk size should be exposed in the input with a better default.

kockan commented 1 year ago

Workaround added.