Closed evgenytsydenov closed 1 year ago
Thanks for raising this. Many US based folks will be unreachable during the holidays this week. Someone should get back to you soon!
Can you show the complete error log and show the line where it fails?
Can you show the complete error log and show the line where it fails?
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\dataframe\core.py:3255: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
meta = self._meta[_extract_meta(key)]
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\dataframe\core.py:3255: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
meta = self._meta[_extract_meta(key)]
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py:119: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
return func(*args2)
Traceback (most recent call last):
File "c:/Users/janet/Desktop/gith.py", line 63, in <module>
preprocessing(path, 40)
File "c:/Users/janet/Desktop/gith.py", line 53, in preprocessing
file.create_dataset(dataset, data.shape, data = data)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\h5py\_hl\dataset.py", line 83, in make_new_dset
else base.guess_dtype(data)))
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\array\core.py", line 1314, in __array__
x = self.compute()
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 165, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\threaded.py", line 81, in get
**kwargs
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 486, in get_async
raise_exception(exc, tb)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 316, in reraise
raise exc
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\local.py", line 222, in execute_task
result = _execute_task(task, data)
File "C:\Users\janet\.conda\envs\tf_2.0\lib\site-packages\dask\core.py", line 119, in _execute_task
return func(*args2)
IndexError: index 40858 is out of bounds for axis 0 with size 10552
Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
File "input_values.csv" has shape (500000, 407) and "output_values.csv" - (500000, 695).
This code:
def preprocessing():
input_file = dd.read_csv(r'C:\temp\input_values.csv', header = None)
output_file = dd.read_csv(r'C:\temp\output_values.csv', header = None)
input_train, input_test, output_train, output_test = train_test_split(
input_file.to_dask_array(lengths=True),
output_file.to_dask_array(lengths=True),
test_size = 0.2)
path_to_output = r'C:\temp\test.hdf5'
with h5py.File(path_to_output, 'w') as file:
file.create_dataset('input_train', input_train.shape, data = input_train)
file.create_dataset('input_test', input_test.shape, data = input_test)
file.create_dataset('output_train', output_train.shape, data = output_train)
file.create_dataset('output_test', output_test.shape, data = output_test)
if __name__ == "__main__":
preprocessing()
gives the same IndexError:
Is reading from a file essential to reproducing things, or can you replicate it without that?
On Dec 23, 2019, at 11:25, Evgeny Tsydenov notifications@github.com wrote:
Can you make a minimal example that reproduces the error? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
File "input_values.csv" has shape (500000, 407) and "output_values.csv" - (500000, 695).
This code:
def preprocessing(): input_file = dd.read_csv(r'C:\temp\input_values.csv', header = None) output_file = dd.read_csv(r'C:\temp\output_values.csv', header = None)
input_train, input_test, output_train, output_test = train_test_split( input_file.to_dask_array(lengths=True), output_file.to_dask_array(lengths=True), test_size = 0.2) path_to_output = r'C:\temp\test.hdf5' with h5py.File(path_to_output, 'w') as file: file.create_dataset('input_train', input_train.shape, data = input_train) file.create_dataset('input_test', input_test.shape, data = input_test) file.create_dataset('output_train', output_train.shape, data = output_train) file.create_dataset('output_test', output_test.shape, data = output_test)
if name == "main": preprocessing() gives the same IndexError:
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Is reading from a file essential to reproducing things, or can you replicate it without that?
I have to read data from files because they are so big to open. It's difficult to copy data from them and fill in the dataframes manually. When I use smaller files with the same data format, the method works well.
So to confirm
Anything you can do to simplify this would be welcome. I'd recommend simplifying your original script (can you remove the "delete errors" section?). http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports has some tips.
I have two files which I want to preprocess before ANN training. The size of each file is about 3GB, so I decided to use Dask. The shape of the input file is (500000, 410), the output file - (500000, 695).
I need to:
The code:
And it gives 'IndexError: index 74442 is out of bounds for axis 0 with size 13987'
But if I change 'train_test_split' to:
it will finish successfully.
Why is it so? Is it correct to drop rows with errors in this way? Because it throws 'UserWarning: Boolean Series key will be reindexed to match DataFrame index.' at each step.