cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
289 stars 51 forks source link

Issues accessing data archive #310

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Good morning, I'm trying to run the software xpclr that can be found at the following URL https://github.com/hardingnj/xpclr. This software accepts both VCF and HDF5 as input. However, when I provide an input in format HDF5 created through scikit_allel, I get the following error:

/PATH/TO/Andrea/myanaconda/xpclr/lib/python3.6/site-packages/xpclr/util.py:16: H5pyDeprecationWarning: The default file mode will change to 'r' (read-only) in h5py 3.0. To suppress this warning, pass the mode you need to h5py.File(), or set the global default h5.get_config().default_file_mode, or set the environment variable H5PY_DEFAULT_READONLY=1. Available modes are: 'r', 'r+', 'w', 'w-'/'x', 'a'. See the docs for details.
  samples_x = h5py.File(hdf5_fn)[chrom]["samples"][:]
Traceback (most recent call last):
  File "/PATH/TO/Andrea/myanaconda/xpclr/bin/xpclr", line 196, in <module>
    main()
  File "/PATH/TO/Andrea/myanaconda/xpclr/bin/xpclr", line 113, in main
    gdistkey=args.gdistkey)
  File "/PATH/TO/Andrea/myanaconda/xpclr/lib/python3.6/site-packages/xpclr/util.py", line 16, in load_hdf5_data
    samples_x = h5py.File(hdf5_fn)[chrom]["samples"][:]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/PATH/TO/Andrea/myanaconda/xpclr/lib/python3.6/site-packages/h5py/_hl/group.py", line 264, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '25' doesn't exist)"

I've had a long chat with the developer of the software here https://github.com/hardingnj/xpclr/issues/49, but we couldn't figure out a solution. Do you know what might be causing the problem with my HDF5 file, and how can I work around the issue?

Thank you in advance for the help Andrea

hardingnj commented 4 years ago

Hi Andrea,

Apologies- if caused any confusion here. There isn't a software issue with scikit-allel. This is more of a "best practice" issue.

When converting VCF into hdf5 (or zarr) we should do like the following:

for contig in contigs:
    try:
        allele.vcf_to_hdf5(vcf_path, hdf5_path, region=contig, group=contig, fields="*")
    except StopIteration:
        print('no data for contig', contig)

This will create a file structure that looks like this:

- 1* (name of contig)
  - calldata*
    - GT
    - AD
    - etc
  - variants*
    - POS
    - REF
    - ALT
  - samples

* denotes group rather than a dataset.

When developing xpclr, I/we had the convention of storing samples at the same level as contig. This was a hangover from vcfnp, which predated scikit-allel, and is now made obselete by allel.

To run xpclr using hdf5 or zarr you will need to move/copy the samples dataset to the level of the contig. To do this, you may need some familiarity with hdf5 and how allel is creating these groups and datasets, which is why I linked you here.

ie more like:

- 1* (name of contig)
  - calldata*
    - GT
    - AD
    - etc
  - variants*
    - POS
    - REF
    - ALT
- 2* (name of contig)
  - calldata*
  - variants*
- samples