Scyfer / fuel

A data pipeline framework for machine learning
MIT License
3 stars 0 forks source link

BalancedSamplingScheme does not work if load_in_memory=False #13

Open aukejw opened 8 years ago

aukejw commented 8 years ago

Apparently, h5py files do not support indexing in the form indexable[np.array([0, 0, 1, 1]), ...] because of the duplicates:

    data[indices] = indexable[numpy.array(request)[indices], ...]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-build-uaMqRR/h5py/h5py/_objects.c:2574)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-build-uaMqRR/h5py/h5py/_objects.c:2533)
  File "/home/auke/.virtualenvs/scyfernn/local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 451, in __getitem__
    self.id.read(mspace, fspace, arr, mtype)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-build-uaMqRR/h5py/h5py/_objects.c:2574)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-build-uaMqRR/h5py/h5py/_objects.c:2533)
  File "h5py/h5d.pyx", line 177, in h5py.h5d.DatasetID.read (/tmp/pip-build-uaMqRR/h5py/h5py/h5d.c:3123)
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw (/tmp/pip-build-uaMqRR/h5py/h5py/_proxy.c:1769)
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread (/tmp/pip-build-uaMqRR/h5py/h5py/_proxy.c:1411)
IOError: Can't read data (Src and dest data spaces have different sizes)

We'll need to find a workaround, warn the user that load_in_memory must be True, or both.

markusnagel commented 8 years ago

Interesting. Though this seems rather a bug in the H5PYDataset and not in our sampling scheme. Does it only occur if two indexes (the same datapoint) by accident is multiple times in the same request (i.e. batch)?

aukejw commented 8 years ago

It only seems to occur when there are duplicates in the request.

markusnagel commented 8 years ago

Maybe we should create an issue in mila/fuel? For me it seems general enough that it is not dependent on our Sampling scheme. Any random sampling scheme that allows sampling with replacement will have the same problem. Of cause it is quite unlikely for big datasets, but the smaller it is the bigger the problem gets...