dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
902 stars 256 forks source link

train_test_split in dask_ml - IndexError: index is out of bounds #589

Closed evgenytsydenov closed 1 year ago

evgenytsydenov commented 4 years ago

I have two files which I want to preprocess before ANN training. The size of each file is about 3GB, so I decided to use Dask. The shape of the input file is (500000, 410), the output file - (500000, 695).

I need to:

The code:

def preprocessing(path, random_seed):
    np.random.seed(random_seed)
    random.seed(random_seed)
    input_file = dd.read_csv(
        os.path.join(path, 'input_values.csv'), 
        header = None, sep = ';', dtype = np.float64)

    output_file = dd.read_csv(
        os.path.join(path, 'output_values.csv'), 
        header = None, sep = ';', dtype = np.float64)

    # Delete errors
    errors = output_file[0] == 0
    errors = errors.compute().reset_index()[0]
    input_data = input_file[~errors].iloc[:,:-1].to_dask_array(lengths=True)
    output_data = output_file[~errors].iloc[:,:-1].to_dask_array(lengths=True)

    # Split to datasets
    input_train, input_test, output_train, output_test = train_test_split(
        input_data, output_data, random_state = random_seed,
        shuffle = True, test_size = 0.2)
    input_train, input_val, output_train, output_val  = train_test_split(
        input_train, output_train, random_state = random_seed,
        shuffle = True, test_size = 0.25)   

    # Scaling
    scaler = StandardScaler()
    scaler.fit(input_train)
    data_scaled = {
        'input_train': scaler.transform(input_train),
        'input_test': scaler.transform(input_test),
        'input_val': scaler.transform(input_val),
        'output_train': output_train,
        'output_test': output_test,
        'output_val': output_val
    }

    # Save data to hdf5
    file_name = os.path.basename(os.path.normpath(path))
    path_to_output = os.path.join(path, file_name) + '.hdf5'
    with h5py.File(path_to_output, 'w') as file:
        for dataset in data_scaled.keys():
            data = data_scaled[dataset]
            file.create_dataset(dataset, data.shape, data = data)

        metadata = {
            'scaler_mean': scaler.mean_,
            'scaler_scale': scaler.scale_,
        }
        file.attrs.update(metadata)

And it gives 'IndexError: index 74442 is out of bounds for axis 0 with size 13987'

But if I change 'train_test_split' to:

sep_train = int(round(0.6 * input_data.shape[0]))
sep_test = int(round(0.8 * input_data.shape[0]))
input_train = input_data[:sep_train, :]
input_test = input_data[sep_train:sep_test, :]
input_val = input_data[sep_test:, :]
output_train = output_data[:sep_train, :]
output_test = output_data[sep_train:sep_test, :]
output_val = output_data[sep_test:, :]

it will finish successfully.

Why is it so? Is it correct to drop rows with errors in this way? Because it throws 'UserWarning: Boolean Series key will be reindexed to match DataFrame index.' at each step.

jacobtomlinson commented 4 years ago

Thanks for raising this. Many US based folks will be unreachable during the holidays this week. Someone should get back to you soon!

datajanko commented 4 years ago

Can you show the complete error log and show the line where it fails?

evgenytsydenov commented 4 years ago

Can you show the complete error log and show the line where it fails?

C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\dataframe\core.py:3255: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  meta = self._meta[_extract_meta(key)]
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\dataframe\core.py:3255: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  meta = self._meta[_extract_meta(key)]
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  return func(*args2)
Traceback (most recent call last):
  File "c:/Users/janet/Desktop/gith.py", line 63, in <module>
    preprocessing(path, 40)
  File "c:/Users/janet/Desktop/gith.py", line 53, in preprocessing
    file.create_dataset(dataset, data.shape, data = data)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\dataset.py", line 83, in make_new_dset
    else base.guess_dtype(data)))
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\array\core.py", line 1314, in __array__
    x = self.compute()
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 165, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\threaded.py", line 81, in get
    **kwargs
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 316, in reraise
    raise exc
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py", line 119, in _execute_task
    return func(*args2)
IndexError: index 40858 is out of bounds for axis 0 with size 10552
TomAugspurger commented 4 years ago

Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

evgenytsydenov commented 4 years ago

Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

File "input_values.csv" has shape (500000, 407) and "output_values.csv" - (500000, 695).

This code:

def preprocessing():
    input_file = dd.read_csv(r'C:\temp\input_values.csv', header = None)
    output_file = dd.read_csv(r'C:\temp\output_values.csv', header = None)

    input_train, input_test, output_train, output_test = train_test_split(
        input_file.to_dask_array(lengths=True), 
        output_file.to_dask_array(lengths=True), 
        test_size = 0.2)  

    path_to_output = r'C:\temp\test.hdf5'
    with h5py.File(path_to_output, 'w') as file:
        file.create_dataset('input_train', input_train.shape, data = input_train)
        file.create_dataset('input_test', input_test.shape, data = input_test)
        file.create_dataset('output_train', output_train.shape, data = output_train)
        file.create_dataset('output_test', output_test.shape, data = output_test)

if __name__ == "__main__":
    preprocessing()

gives the same IndexError:

```python-traceback Traceback (most recent call last): File "c:/Users/janet/Desktop/gith.py", line 27, in preprocessing() File "c:/Users/janet/Desktop/gith.py", line 23, in preprocessing file.create_dataset('output_train', output_train.shape, data = output_train) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\group.py", line 136, in create_dataset dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\dataset.py", line 83, in make_new_dset else base.guess_dtype(data))) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\array\core.py", line 1314, in __array__ x = self.compute() File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 165, in compute (result,) = compute(self, traverse=False, **kwargs) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 436, in compute results = schedule(dsk, keys, **kwargs) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\threaded.py", line 81, in get **kwargs File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 486, in get_async raise_exception(exc, tb) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 316, in reraise raise exc File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 222, in execute_task result = _execute_task(task, data) File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py", line 119, in _execute_task return func(*args2) IndexError: index 43989 is out of bounds for axis 0 with size 10552 ```
TomAugspurger commented 4 years ago

Is reading from a file essential to reproducing things, or can you replicate it without that?

On Dec 23, 2019, at 11:25, Evgeny Tsydenov notifications@github.com wrote:

 Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

File "input_values.csv" has shape (500000, 407) and "output_values.csv" - (500000, 695).

This code:

def preprocessing(): input_file = dd.read_csv(r'C:\temp\input_values.csv', header = None) output_file = dd.read_csv(r'C:\temp\output_values.csv', header = None)

input_train, input_test, output_train, output_test = train_test_split(
    input_file.to_dask_array(lengths=True), 
    output_file.to_dask_array(lengths=True), 
    test_size = 0.2)  

path_to_output = r'C:\temp\test.hdf5'
with h5py.File(path_to_output, 'w') as file:
    file.create_dataset('input_train', input_train.shape, data = input_train)
    file.create_dataset('input_test', input_test.shape, data = input_test)
    file.create_dataset('output_train', output_train.shape, data = output_train)
    file.create_dataset('output_test', output_test.shape, data = output_test)

if name == "main": preprocessing() gives the same IndexError:

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

evgenytsydenov commented 4 years ago

Is reading from a file essential to reproducing things, or can you replicate it without that?

I have to read data from files because they are so big to open. It's difficult to copy data from them and fill in the dataframes manually. When I use smaller files with the same data format, the method works well.

TomAugspurger commented 4 years ago

So to confirm

  1. You can't reproduce with in-memory datasets (regardless of size, perhaps because you run out of RAM)?
  2. You can't reproduce with smaller on-disk datasets?

Anything you can do to simplify this would be welcome. I'd recommend simplifying your original script (can you remove the "delete errors" section?). http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports has some tips.