MLD3 / FIDDLE

FlexIble Data-Driven pipeLinE – a preprocessing pipeline that transforms structured EHR data into feature vectors to be used with ML algorithms. https://doi.org/10.1093/jamia/ocaa139
http://tiny.cc/get_FIDDLE
MIT License
83 stars 18 forks source link

Error when not discretizing MIMIC-III time-series data - TypeError: bad operand type for unary ~: 'float' #11

Open MattHodgman opened 1 year ago

MattHodgman commented 1 year ago

I am running FIDDLE on data extracted from MIMIC-III using the pipeline outlined in FIDDLE-experiments. I have my population of ICU stays and am running FIDDLE with these parameters:

--T=240.0 --dt=1.0 --theta_1=0.003 --theta_2=0.003 --theta_freq=1 --stats_functions 'mean'

and other default ones found in run_make_all.sh.

I get the following error:

Traceback (most recent call last):  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 193, in _run_module_as_main  
    "__main__", mod_spec)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 85, in _run_code  
    exec(code, run_globals)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 141, in <module>  
    main()  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 138, in main  
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 244, in process_time_dependent  
    X_all, X_all_feature_names, X_discretization_bins = map_time_series_features(df_time_series, dtypes_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 604, in map_time_series_features  
    df.loc[~numeric_mask, col] = np.nan  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/generic.py", line 1532, in __invert__  
    new_data = self._mgr.apply(operator.invert)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 325, in apply  
    applied = b.apply(f, **kwargs)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 381, in apply  
    result = func(self.values, **kwargs)  
TypeError: bad operand type for unary ~: 'float'

Do you know what could be causing this error? I was able to determine that it first occurs in the column 225958 and numeric_mask contains at least one NaN value which must mean column 225958 contains None values however in in my input_data.p file there are no None or NaN variable_values for variable_name == '225958'.

shengpu-tang commented 1 year ago

Hello, the numeric_mask is generated from the is_numeric function in helpers.py: https://github.com/MLD3/FIDDLE/blob/master/FIDDLE/helpers.py#L191 on this line: https://github.com/MLD3/FIDDLE/blob/master/FIDDLE/steps.py#L601

I agree with your logic, so it is indeed surprising if input_data.p does not contain None/NaN but numeric_mask contains NaN. Perhaps you could try with a small example with/without nans and apply the is_numeric function to that column?

MattHodgman commented 1 year ago

is_numeric works when I extract the 225958 feature column from input_data.p to col_data and run

numeric_mask = col_data.apply(is_numeric)

numeric_mask only contains True and False values. When I switch one of these booleans to np.nan or a float I can reproduce the error. I'm going to see if I can extract the ts_mixed dataframe from https://github.com/MLD3/FIDDLE/blob/master/FIDDLE/steps.py#L594 and look at feature 225958.