Open bwang482 opened 2 years ago
Or does it mean icd code cannot be time dependent variables? Surely they should be allowed?
Hello, I have just updated the code to fix this error. Please download the latest code from GitHub.
You may check out an example with data containing time-dependent ICD codes here. Please try to format your data according to this example.
Additionally, if the variable_name
for your ICD code data is not "ICD9_CODE"
, you will need to change the following to your config file:
https://github.com/MLD3/FIDDLE/blob/86b197fc7ac3e6e90851e4bf01279156539aaee2/tests/icd_time_test/input/config-0.yaml#L4-L5
Many thanks @shengpu1126 ! Can I please confirm with you:
icd_time_test
example, your icd code is a series of letters and numbers e.g. V502
but your hierarchical_sep: ':'
. My diagnosis code contains one or more dots e.g. V50.2
. Should I get rid of the dots or should I set hierarchical_sep: '.'
if hierarchical_sep
is indeed used for this purpose?DRG:389
. Do you recommend I use hierarchical
or Categorical
as value_types
?As currently I can only use hierarchical_levels: [0]
. If I set hierarchical_levels: [0, 1]
the error below occurs even though I have different levels in my diagnosis codes.
================================================================================
2) Transform; 3) Post-filter
================================================================================
--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in parse_variable_data_type
df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
return self.apply_standard()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
convert=self.convert_dtype,
File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 99, in <lambda>
df_hier_level[val_col] = df_hier_level[val_col].apply(lambda h: h[min(hier_level, len(h))])
IndexError: list index out of range
- I have noticed in your
icd_time_test
example, your icd code is a series of letters and numbers e.g.V502
but yourhierarchical_sep: ':'
. My diagnosis code contains one or more dots e.g.V50.2
. Should I get rid of the dots or should I sethierarchical_sep: '.'
ifhierarchical_sep
is indeed used for this purpose?
There's built-in support for ICD9/ICD10 codes through icd9cms and icd10-cm packages, I believe both V50.2
and V502
should work. The :
separator is for other types of hierarchical values that need to be preprocessed.
My diagnosis codes contain some non-icd codes e.g. DRG:389. Do you recommend I use hierarchical or Categorical as value_types?
I am less familiar with DRG codes. Does the DRG code of 389
have multiple levels? Similar to ICD9 code V502
having two levels V50
and V50.2
. If not I think you may just treat it as a Categorical variable, for example:
ID | t | variable_name | variable_value |
---|---|---|---|
XXX | 4 | DRG:1234 | 1 |
XXX | 5 | ICD9_CODE | V502 |
Otherwise you should preprocess it and include the separator: | ID | t | variable_name | variable_value |
---|---|---|---|---|
XXX | 4 | DRG_CODE | 12:34 | |
XXX | 5 | ICD9_CODE | V502 |
Many thanks, @shengpu1126 ! I have separated ICD 9 and 10 codes from the rest, and named each coding scheme uniquely, e.g.:
ICD9_CODE: hierarchical_ICD9
ICD10_CODE: hierarchical_ICD10
DRG_CODE: Categorical
DSM4_CODE: hierarchical
I then got the error below, which was strange since '645.03' is a legitimate ICD9 code that indicates "Prolonged pregnancy, antepartum condition or complication" in ICD9.
--------------------------------------------------------------------------------
*) Detecting and parsing value types
--------------------------------------------------------------------------------
Parsing hierarchical values
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 131, in main
df_data, df_types = FIDDLE_steps.parse_variable_data_type(df_data, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in parse_variable_data_type
df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1043, in apply
return self.apply_standard()
File "D:\bo\envs\bd\lib\site-packages\pandas\core\apply.py", line 1101, in apply_standard
convert=self.convert_dtype,
File "pandas\_libs\lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 83, in <lambda>
df_var = df_var.apply(lambda s: map_icd_hierarchy(s, version=9))
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 39, in map_icd_hierarchy
raise Exception("Invalid ICD code", s)
Exception: ('Invalid ICD code', '645.03')
I then removed the dots as mentioned earlier but the error stayed Exception: ('Invalid ICD code', '64503')
.
However changing
ICD9_CODE: hierarchical_ICD9
ICD10_CODE: hierarchical_ICD10
to
ICD9_CODE: hierarchical
ICD10_CODE: hierarchical
and switching back to codes that have the separator in them (hierarchical_sep: "."
) worked.
I have though now encountered a new error:
--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables : 771
Frequent variables : []
M₁ = 0
M₂ = 771
k = 3 ['min', 'max', 'mean']
Transforming each example...
0%| | 0/200 [00:00<?, ?it/s]10000377
Traceback (most recent call last):
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 371, in func_encode_single_time_series
df_j = pivot_event_table(g).reindex(columns=variables_non).sort_index()
File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 223, in pivot_event_table
df_dups.loc[df_v.index, t_col] += eps * np.arange(len(df_v))
File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10964, in __iadd__
return self._inplace_method(other, type(self).__add__) # type: ignore[operator]
File "D:\bo\envs\bd\lib\site-packages\pandas\core\generic.py", line 10941, in _inplace_method
result = op(self, other)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
return method(self, other)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\arraylike.py", line 92, in __add__
return self._arith_method(other, operator.add)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 5526, in _arith_method
result = ops.arithmetic_op(lvalues, rvalues, op)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 224, in arithmetic_op
res_values = _na_arithmetic_op(left, right, op)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\ops\array_ops.py", line 166, in _na_arithmetic_op
result = func(left, right)
File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 239, in evaluate
return _evaluate(op, op_str, a, b) # type: ignore[misc]
File "D:\bo\envs\bd\lib\site-packages\pandas\core\computation\expressions.py", line 69, in _evaluate_standard
return op(a, b)
ValueError: operands could not be broadcast together with shapes (121,) (11,)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
main()
File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 236, in process_time_dependent
df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in transform_time_series_table
for i, g in tqdm(grouped[:N])
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 462, in <genexpr>
for i, g in tqdm(grouped[:N])
File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 391, in func_encode_single_time_series
raise Exception(i)
Exception: 10000377
2%|█▌ | 4/200 [00:00<00:15, 12.80it/s]
After some serious digging, I have found the error traceback to line 223 in the pivot_event_table
function in helpers.py, which is used in line 371 in the func_encode_single_time_series
function in steps.py. Its; because eps * np.arange(len(df_v))
has a lower dimension than df_dups.loc[df_v.index, t_col]
. I have discovered in this particular data instance which is throwing this exception has the same var_name
and var_value
multiple times at the same t
:
48989 10000377 2.476712 ERFV_CODE 160431
48990 10000377 2.515068 ERFV_CODE 122
48991 10000377 2.701370 ERFV_CODE 751
48992 10000377 2.701370 ERFV_CODE 751
48993 10000377 2.701370 ERFV_CODE 751
48994 10000377 2.701370 ERFV_CODE 751
48995 10000377 2.706849 ERFV_CODE 751
and in g
(line 371 in the func_encode_single_time_series
function) this looks like:
48989 10000377 2.476712 ERFV_CODE _160431
48990 10000377 2.515068 ERFV_CODE _122
48991 10000377 2.701370 ERFV_CODE:_751 1
48992 10000377 2.701370 ERFV_CODE:_751 1
48993 10000377 2.701370 ERFV_CODE:_751 1
48994 10000377 2.701370 ERFV_CODE:_751 1
Do you have any suggestions on how to deal with this situation pls? I am not sure what the 1s represent in val_col
. Does it mean a number of occurrences? Why in some cases we have ERFV_CODE _122
but in some other cases ERFV_CODE:_751 1
?
Hi,
The parser for ICD9/ICD10 relies on third-party packages that I do not have control of, so it is possible the dictionary they use is outdated and may be missing some of the codes. In that case, I agree with what you did which is to preprocess them by adding the separators.
As for the issue of duplicates, the pipeline was not designed to handle duplicates. This is because for most types of EHR data like vital signs, there should not be two different values for the same patient at one point in time. There are several things you could try that may help address the error you saw:
Many thanks @shengpu1126 !
Looking at the last example in my previous comment, can I please ask why you have different formats for var_name
and var_value
? e.g.
48989 10000377 2.476712 ERFV_CODE _160431
vs.
48991 10000377 2.701370 ERFV_CODE:_751 1
Or, e.g. reading the final df_X
I have noticed two different ways of representing ERFV_CODE:160431
:
ERFV_CODE_value__160431
vs
ERFV_CODE:_160431_value_1
Are they different in terms of how one should interpret them?
Also, what would ICD9_CODE_value_(1.999, 314.0]
possibly represent?
Also, what would
ICD9_CODE_value_(1.999, 314.0]
possibly represent?
This is likely because some ICD codes looks like numbers and python would interpret them as numbers unless we explicitly tell it these are strings. One workaround I usually use is to prepend an underscore "123" -> "_123" so they cannot be interpreted as numbers.
I am not using MIMIC-III or eicu data, and since this pipeline should e applicable to other EHR data sets, I am using it for in-house EHR data. No matter how I preprocess ICD codes e.g.
ICD9:V50.2
vsV50.2
vsV502
. I always encounter the error below:So my
df_types
only one icd related variable nameicd_code
which is correct. However theparse_variable_data_type
process has made a whole new list of variable names with icd at the beginning. Thus whyvariables
has a long list of "icd_code:*" elements. The whole process is very confusing and vague in details. Would you please enlighten me on the source of the error? Many thanks.