EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
13 stars 4 forks source link

ValueError columns expected but not found: ['40002-0.0'] in baseline_to_condition_occurrence #358

Closed MaximMoinat closed 2 years ago

MaximMoinat commented 2 years ago

In 2022-03 run on real data the baseline_to_condition_occurrence mapping fails. See error message below.

Possible issue could be the array instances, that start at 1 (!) for field 40002. So -0.0 will not exist. But maybe -0.1 does. To be confirmed by Vaclav.

--edit--: this mapping might be unnecessary when mapping the separate death tables and we can ignore all death register fields from baseline #356. We quickly fix this issue for future use and then disable the mapping for UKB.

2022-03-27 08:34:57,415 - INFO - Executing batched transformation: baseline_to_condition_occurrence 
2022-03-27 08:34:57,416 - INFO - Reading baseline.csv as DataFrame
Traceback (most recent call last):
  File "main.py", line 43, in <module>
    sys.exit(main(auto_envvar_prefix='ETL'))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 39, in main
    etl.run()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 66, in run
    self.transform()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 91, in transform
    self.execute_batch_transformation(baseline_to_condition_occurrence, bulk=True, batch_size=100000)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/orm_wrapper.py", line 126, in execute_batch_transformation
    for record in records_generator:
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/transformation/baseline_to_condition_occurrence.py", line 14, in baseline_to_condition_occurrence
    df = source.get_csv_as_df(apply_dtypes=False, usecols=['eid', '40000-0.0', '40002-0.0'])
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 119, in get_csv_as_df
    force_reload=force_reload, cache=cache, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 187, in _get_df
    df = read_func(apply_dtypes, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 201, in _read_csv_as_df
    df = pd.read_csv(self._path, dtype='object', **full_kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1937, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1233, in _validate_usecols_names
    "Usecols do not match columns, "
ValueError: Usecols do not match columns, columns expected but not found: ['40002-0.0']
MaximMoinat commented 2 years ago

There are more fields that start array indexing at 1, we need to check where this can also cause issues in the stem_to_baseline.

It does seem like none of these fields are prioritised fields.

field_id title main_category
20147 Errors before selecting correct item in numeric path (trail # 1) 121
20148 Errors before selecting correct item in alphanumeric path (trail # 2) 121
20149 Interval between previous point and current one in numeric path (trail # 1) 121
20155 Interval between previous point and current one in alphanumeric path (trail # 2) 121
20544 Mental health problems ever diagnosed by a professional 137
20546 Substances taken for depression 138
20547 Activities undertaken to treat depression 138
20548 Manifestations of mania or irritability 139
20549 Substances taken for anxiety 140
20550 Activities undertaken to treat anxiety 140
20551 Substance of prescription or over-the-counter medication addiction 141
20552 Behavioural and miscellaneous addictions 141
20553 Methods of self-harm used 146
20554 Actions taken following self-harm 146
10691 Result ranking (pilot) 100020
10693 Acceptability of each blow result (pilot) 100020
10694 Forced vital capacity (FVC) (pilot) 100020
10695 Forced expiratory volume in 1-second (FEV1) (pilot) 100020
10696 Peak expiratory flow (PEF) (pilot) 100020
10697 Data points for blow (pilot) 100020
20032 Acceptability of each blow result (text) (pilot) 100020
10142 Number of columns displayed (pilot) 100028
10143 Number of rows displayed (pilot) 100028
10144 Time taken to complete lights test (pilot) 100028
10145 Pattern of lights displayed (pilot) 100028
10146 Pattern of lights as remembered (pilot) 100028
396 Number of columns displayed in round 100030
397 Number of rows displayed in round 100030
398 Number of correct matches in round 100030
399 Number of incorrect matches in round 100030
400 Time to complete round 100030
6334 Screen layout 100030
10133 Number of columns displayed in round (pilot) 100030
10134 Number of rows displayed in round (pilot) 100030
10136 Number of correct matches in round (pilot) 100030
10137 Number of incorrect matches in round (pilot) 100030
10138 Time to complete round (pilot) 100030
4229 Triplet played (left) 100049
4230 Signal-to-noise-ratio (SNR) of triplet (left) 100049
4232 Triplet correct (left) 100049
4233 Mean signal-to-noise ratio (SNR), (left) 100049
4234 Time to press first digit (left) 100049
4235 Time to press last digit (left) 100049
4236 Triplet entered (left) 100049
4237 Time to press 'next' (left) 100049
4238 Keystroke history (left) 100049
4239 Number of times 'clear' was pressed (left) 100049
4240 Triplet played (right) 100049
4241 Signal-to-noise-ratio (SNR) of triplet (right) 100049
4242 Triplet entered (right) 100049
4243 Triplet correct (right) 100049
4244 Mean signal-to-noise ratio (SNR), (right) 100049
4245 Time to press first digit (right) 100049
4246 Time to press last digit (right) 100049
4247 Time to press 'next' (right) 100049
4248 Keystroke history (right) 100049
4249 Number of times 'clear' was pressed (right) 100049
40002 Contributory (secondary) causes of death: ICD10 100093
22009 Genetic principal components 100313
MaximMoinat commented 2 years ago

In stem_to_baseline, the array indexes are ignored entirely. So this is not an issue.