P20WCommunityOfInnovation / DAR-T

Apache License 2.0
3 stars 0 forks source link

Resolve issue with memory limit error #87

Closed sethltaylor closed 8 months ago

sethltaylor commented 8 months ago

There is a memory limit error caused by a district level file (7MB in size). @aemnathanclinton to add traceback/details on function configuration that caused the error for further debugging.

aemnathanclinton commented 8 months ago

File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 535, in _run_script exec(code, module.dict) File "C:\Repos\DAR-T\app\dart_app.py", line 136, in df_redacted = anonymizer.apply_anonymization() File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\dar_tool\suppression_check.py", line 530, in apply_anonymization self.cross_suppression() File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\dar_tool\suppression_check.py", line 429, in cross_suppression df_primary = self.df_log.merge(df_parent_list, on = self.organization_columns + list_combination, how='left') File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\frame.py", line 9351, in merge return merge( File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 122, in merge return op.get_result() File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\reshape\merge.py", line 725, in get_result result_data = concatenate_managers( File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\internals\concat.py", line 242, in concatenate_managers values = _concatenate_join_units(join_units, concat_axis, copy=copy) File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\internals\concat.py", line 545, in _concatenate_join_units to_concat = [ File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\internals\concat.py", line 546, in ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na) File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\internals\concat.py", line 525, in get_reindexed_values values = algos.take_nd(values, indexer, axis=ax) File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\array_algos\take.py", line 117, in take_nd return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill) File "C:\Users\nclinton\AppData\Roaming\Python\Python310\site-packages\pandas\core\array_algos\take.py", line 158, in _take_nd_ndarray out = np.empty(out_shape, dtype=dtype) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 35.5 GiB for an array with shape (67, 71074338) and data type float64

sethltaylor commented 8 months ago

@aemnathanclinton Thanks for adding. Looks like something is going wrong in a merge in cross_suppression or the step before as we end up with a dataframe with 71 million rows. What were the inputs to DataAnonymizer?

@VT-AOE-DMAD-Drew-Bennett I am thinking this is something to do with the list_combination in the merge. Maybe a many-to-many join run wild?

aemnathanclinton commented 8 months ago

Here are the parameters for the run.

image