alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.26k stars 879 forks source link

ValueError: Sample larger than population or is negative, and then Fatal Python error #2370

Open dehiker opened 1 year ago

dehiker commented 1 year ago

When I tried to calculate_feature_matrix by chunks, I kept encountering ValueError, following which a Fatal Python error usually occurred. Please note this error only occured after some chunks calculation, and no error showed up if I continued from where it failed with restarting the python script. Please see below for full trace info.

2022-11-10 15:31:53,351 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.04 GiB -- Worker memory limit: 3.79 GiB Traceback (most recent call last): File "/home/zzz/python/test.py", line 306, in ft_test feature_matrix_ = ft.calculate_feature_matrix( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 316, in calculate_feature_matrix feature_matrix = parallel_calculate_chunks( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 792, in parallel_calculate_chunks client.replicate([_es, _saved_features]) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/client.py", line 3481, in replicate return self.sync( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync return sync( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 378, in f result = yield future File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/client.py", line 3439, in _replicate await self.scheduler.replicate( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1153, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 943, in send_recv raise exc.with_traceback(tb) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 769, in _handle_comm result = await result File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/scheduler.py", line 5781, in replicate for ws in random.sample(tuple(workers - ts.who_has), count): File "/home/zzz/.conda/envs/test/lib/python3.9/random.py", line 449, in sample raise ValueError("Sample larger than population or is negative") ValueError: Sample larger than population or is negative 2022-11-10 15:31:53,461 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.99 GiB -- Worker memory limit: 3.79 GiB Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join: Traceback (most recent call last): File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/process.py", line 236, in _watch_process assert exitcode is not None AssertionError Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join: Traceback (most recent call last): File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner Using EntitySet persisted on the cluster as dataset EntitySet-a3d41f24f216a89dd794828f2871b580 self.run() Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=0x17a50a0) Current thread 0x00007f992d262280 (most recent call first):

Also, something that may be relevant with previous fatal error, there're tons of fragmented and Unmanaged memory use warning in the log:

/home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:938: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  return data.assign(**new_cols)
2022-11-11 09:39:14,505 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.15 GiB -- Worker memory limit: 3.79 GiB

Any ideas would be highly appreciated! Best regard!

sbadithe commented 1 year ago

Hi, could you list the version of Featuretools, woodwork, dask[dataframe] and distributed you are using? Thanks!