Open RenzoTale88 opened 4 years ago
Hi Andrea,
Apologies- if caused any confusion here. There isn't a software issue with scikit-allel
. This is more of a "best practice" issue.
When converting VCF into hdf5 (or zarr) we should do like the following:
for contig in contigs:
try:
allele.vcf_to_hdf5(vcf_path, hdf5_path, region=contig, group=contig, fields="*")
except StopIteration:
print('no data for contig', contig)
This will create a file structure that looks like this:
- 1* (name of contig)
- calldata*
- GT
- AD
- etc
- variants*
- POS
- REF
- ALT
- samples
*
denotes group rather than a dataset.
When developing xpclr
, I/we had the convention of storing samples
at the same level as contig. This was a hangover from vcfnp
, which predated scikit-allel, and is now made obselete by allel.
To run xpclr
using hdf5 or zarr you will need to move/copy the samples
dataset to the level of the contig. To do this, you may need some familiarity with hdf5 and how allel
is creating these groups and datasets, which is why I linked you here.
ie more like:
- 1* (name of contig)
- calldata*
- GT
- AD
- etc
- variants*
- POS
- REF
- ALT
- 2* (name of contig)
- calldata*
- variants*
- samples
Good morning, I'm trying to run the software
xpclr
that can be found at the following URL https://github.com/hardingnj/xpclr. This software accepts both VCF and HDF5 as input. However, when I provide an input in format HDF5 created through scikit_allel, I get the following error:I've had a long chat with the developer of the software here https://github.com/hardingnj/xpclr/issues/49, but we couldn't figure out a solution. Do you know what might be causing the problem with my HDF5 file, and how can I work around the issue?
Thank you in advance for the help Andrea