MLD3 / FIDDLE

FlexIble Data-Driven pipeLinE – a preprocessing pipeline that transforms structured EHR data into feature vectors to be used with ML algorithms. https://doi.org/10.1093/jamia/ocaa139
http://tiny.cc/get_FIDDLE
MIT License
83 stars 18 forks source link

Having trouble processing ICD codes #5

Open bwang482 opened 2 years ago

bwang482 commented 2 years ago

I am not using MIMIC-III or eicu data, and since this pipeline should e applicable to other EHR data sets, I am using it for in-house EHR data. No matter how I preprocess ICD codes e.g. ICD9:V50.2 vs V50.2 vs V502. I always encounter the error below:

--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables    : 31734
Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'icd_code:0'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 235, in process_time_dependent
    df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 430, in transform_time_series_table
    variables_num_freq = get_frequent_numeric_variables(df_in, variables, theta_freq, args)
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in get_frequent_numeric_variables
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in <listcomp>
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'icd_code:0'

So my df_types only one icd related variable name icd_code which is correct. However the parse_variable_data_type process has made a whole new list of variable names with icd at the beginning. Thus why variables has a long list of "icd_code:*" elements. The whole process is very confusing and vague in details. Would you please enlighten me on the source of the error? Many thanks.

bwang482 commented 2 years ago

Or does it mean icd code cannot be time dependent variables? Surely they should be allowed?

shengpu-tang commented 2 years ago

Hello, I have just updated the code to fix this error. Please download the latest code from GitHub.

You may check out an example with data containing time-dependent ICD codes here. Please try to format your data according to this example.

Additionally, if the variable_name for your ICD code data is not "ICD9_CODE", you will need to change the following to your config file: https://github.com/MLD3/FIDDLE/blob/86b197fc7ac3e6e90851e4bf01279156539aaee2/tests/icd_time_test/input/config-0.yaml#L4-L5

bwang482 commented 2 years ago

Many thanks @shengpu1126 ! Can I please confirm with you:

  1. I have noticed in your icd_time_test example, your icd code is a series of letters and numbers e.g. V502 but your hierarchical_sep: ':'. My diagnosis code contains one or more dots e.g. V50.2. Should I get rid of the dots or should I set hierarchical_sep: '.' if hierarchical_sep is indeed used for this purpose?
  2. My diagnosis codes contain some non-icd codes e.g. DRG:389. Do you recommend I use hierarchical or Categorical as value_types?
bwang482 commented 2 years ago

As currently I can only use hierarchical_levels: [0]. If I set hierarchical_levels: [0, 1] the error below occurs even though I have different levels in my diagnosis codes.

================================================================================
2) Transform; 3) Post-filter
================================================================================

--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
    df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in parse_variable_data_type
    df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
    return self.apply_standard()
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
    convert=self.convert_dtype,
  File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in <lambda>
    df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
IndexError: list index out of range
shengpu-tang commented 2 years ago
  1. I have noticed in your icd_time_test example, your icd code is a series of letters and numbers e.g. V502 but your hierarchical_sep: ':'. My diagnosis code contains one or more dots e.g. V50.2. Should I get rid of the dots or should I set hierarchical_sep: '.' if hierarchical_sep is indeed used for this purpose?

There's built-in support for ICD9/ICD10 codes through icd9cms and icd10-cm packages, I believe both V50.2 and V502 should work. The : separator is for other types of hierarchical values that need to be preprocessed.

My diagnosis codes contain some non-icd codes e.g. DRG:389. Do you recommend I use hierarchical or Categorical as value_types?

I am less familiar with DRG codes. Does the DRG code of 389 have multiple levels? Similar to ICD9 code V502 having two levels V50 and V50.2. If not I think you may just treat it as a Categorical variable, for example:

ID t variable_name variable_value
XXX 4 DRG:1234 1
XXX 5 ICD9_CODE V502
Otherwise you should preprocess it and include the separator: ID t variable_name variable_value
XXX 4 DRG_CODE 12:34
XXX 5 ICD9_CODE V502
bwang482 commented 2 years ago

Many thanks, @shengpu1126 ! I have separated ICD 9 and 10 codes from the rest, and named each coding scheme uniquely, e.g.:

    ICD9_CODE: hierarchical_ICD9
    ICD10_CODE: hierarchical_ICD10
    DRG_CODE: Categorical
    DSM4_CODE: hierarchical

I then got the error below, which was strange since '645.03' is a legitimate ICD9 code that indicates "Prolonged pregnancy, antepartum condition or complication" in ICD9.

--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
    df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in parse_variable_data_type
    df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
    return self.apply_standard()
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
    convert=self.convert_dtype,
  File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in <lambda>
    df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 39, in map_icd_hierarchy
    raise Exception("Invalid ICD code", s)
Exception: ('Invalid ICD code', '645.03')

I then removed the dots as mentioned earlier but the error stayed Exception: ('Invalid ICD code', '64503').

However changing

    ICD9_CODE: hierarchical_ICD9
    ICD10_CODE: hierarchical_ICD10

to

    ICD9_CODE: hierarchical
    ICD10_CODE: hierarchical

and switching back to codes that have the separator in them (hierarchical_sep: ".") worked.

I have though now encountered a new error:

--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables    : 771
Frequent variables : []
M₁ = 0
M₂ = 771
k  = 3 ['min', 'max', 'mean']

Transforming each example...
  0%|                                                                                       | 0/200 [00:00<?, ?it/s]10000377
Traceback (most recent call last):
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 371, in func_encode_single_time_series
    df_j = pivot_event_table(g).reindex(columns=variables_non).sort_index()
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 223, in pivot_event_table
    df_dups.loc[df_v.index, t_col] += eps * np.arange(len(df_v))
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10964, in __iadd__
    return self._inplace_method(other, type(self).__add__)  # type: ignore[operator]
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10941, in _inplace_method
    result = op(self, other)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
    return method(self, other)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\arraylike.py", line 92, in __add__
    return self._arith_method(other, operator.add)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 5526, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 224, in arithmetic_op
    res_values = _na_arithmetic_op(left, right, op)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 166, in _na_arithmetic_op
    result = func(left, right)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 239, in evaluate
    return _evaluate(op, op_str, a, b)  # type: ignore[misc]
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 69, in _evaluate_standard
    return op(a, b)
ValueError: operands could not be broadcast together with shapes (121,) (11,)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 236, in process_time_dependent
    df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in transform_time_series_table
    for i, g in tqdm(grouped[:N])
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in <genexpr>
    for i, g in tqdm(grouped[:N])
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 391, in func_encode_single_time_series
    raise Exception(i)
Exception: 10000377
  2%|█▌                                                                             | 4/200 [00:00<00:15, 12.80it/s]

After some serious digging, I have found the error traceback to line 223 in the pivot_event_table function in helpers.py, which is used in line 371 in the func_encode_single_time_series function in steps.py. Its; because eps * np.arange(len(df_v)) has a lower dimension than df_dups.loc[df_v.index, t_col]. I have discovered in this particular data instance which is throwing this exception has the same var_name and var_value multiple times at the same t:

48989  10000377  2.476712                     ERFV_CODE         160431
48990  10000377  2.515068                     ERFV_CODE            122
48991  10000377  2.701370                     ERFV_CODE            751
48992  10000377  2.701370                     ERFV_CODE            751
48993  10000377  2.701370                     ERFV_CODE            751
48994  10000377  2.701370                     ERFV_CODE            751
48995  10000377  2.706849                     ERFV_CODE            751

and in g (line 371 in the func_encode_single_time_series function) this looks like:

48989  10000377  2.476712          ERFV_CODE        _160431
48990  10000377  2.515068          ERFV_CODE           _122
48991  10000377  2.701370     ERFV_CODE:_751              1
48992  10000377  2.701370     ERFV_CODE:_751              1
48993  10000377  2.701370     ERFV_CODE:_751              1
48994  10000377  2.701370     ERFV_CODE:_751              1

Do you have any suggestions on how to deal with this situation pls? I am not sure what the 1s represent in val_col. Does it mean a number of occurrences? Why in some cases we have ERFV_CODE _122 but in some other cases ERFV_CODE:_751 1?

shengpu-tang commented 2 years ago

Hi,

The parser for ICD9/ICD10 relies on third-party packages that I do not have control of, so it is possible the dictionary they use is outdated and may be missing some of the codes. In that case, I agree with what you did which is to preprocess them by adding the separators.

As for the issue of duplicates, the pipeline was not designed to handle duplicates. This is because for most types of EHR data like vital signs, there should not be two different values for the same patient at one point in time. There are several things you could try that may help address the error you saw:

  1. Use pandas drop_duplicate function to remove duplicated rows that have the same [ID, t, variable_name] (the rows may have possibly different variable_values)
  2. Add a small constant to the timestamps (e.g. 0.00001) of the duplicated rows so every row has a different timestamp.
bwang482 commented 2 years ago

Many thanks @shengpu1126 !

Looking at the last example in my previous comment, can I please ask why you have different formats for var_name and var_value? e.g.

48989 10000377 2.476712 ERFV_CODE _160431 vs. 48991 10000377 2.701370 ERFV_CODE:_751 1

Or, e.g. reading the final df_X I have noticed two different ways of representing ERFV_CODE:160431:

ERFV_CODE_value__160431 vs ERFV_CODE:_160431_value_1

Are they different in terms of how one should interpret them?

bwang482 commented 2 years ago

Also, what would ICD9_CODE_value_(1.999, 314.0] possibly represent?

shengpu-tang commented 2 years ago

Also, what would ICD9_CODE_value_(1.999, 314.0] possibly represent?

This is likely because some ICD codes looks like numbers and python would interpret them as numbers unless we explicitly tell it these are strings. One workaround I usually use is to prepend an underscore "123" -> "_123" so they cannot be interpreted as numbers.