ValueError columns expected but not found: ['40002-0.0'] in baseline_to_condition_occurrence

MaximMoinat commented 2 years ago

In 2022-03 run on real data the baseline_to_condition_occurrence mapping fails. See error message below.

Possible issue could be the array instances, that start at 1 (!) for field 40002. So -0.0 will not exist. But maybe -0.1 does. To be confirmed by Vaclav.

--edit--: this mapping might be unnecessary when mapping the separate death tables and we can ignore all death register fields from baseline #356. We quickly fix this issue for future use and then disable the mapping for UKB.

2022-03-27 08:34:57,415 - INFO - Executing batched transformation: baseline_to_condition_occurrence 
2022-03-27 08:34:57,416 - INFO - Reading baseline.csv as DataFrame
Traceback (most recent call last):
  File "main.py", line 43, in <module>
    sys.exit(main(auto_envvar_prefix='ETL'))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 39, in main
    etl.run()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 66, in run
    self.transform()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 91, in transform
    self.execute_batch_transformation(baseline_to_condition_occurrence, bulk=True, batch_size=100000)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/orm_wrapper.py", line 126, in execute_batch_transformation
    for record in records_generator:
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/transformation/baseline_to_condition_occurrence.py", line 14, in baseline_to_condition_occurrence
    df = source.get_csv_as_df(apply_dtypes=False, usecols=['eid', '40000-0.0', '40002-0.0'])
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 119, in get_csv_as_df
    force_reload=force_reload, cache=cache, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 187, in _get_df
    df = read_func(apply_dtypes, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 201, in _read_csv_as_df
    df = pd.read_csv(self._path, dtype='object', **full_kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1937, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1233, in _validate_usecols_names
    "Usecols do not match columns, "
ValueError: Usecols do not match columns, columns expected but not found: ['40002-0.0']

MaximMoinat commented 2 years ago

There are more fields that start array indexing at 1, we need to check where this can also cause issues in the stem_to_baseline.

It does seem like none of these fields are prioritised fields.

field_id	title	main_category
20147	Errors before selecting correct item in numeric path (trail # 1)	121
20148	Errors before selecting correct item in alphanumeric path (trail # 2)	121
20149	Interval between previous point and current one in numeric path (trail # 1)	121
20155	Interval between previous point and current one in alphanumeric path (trail # 2)	121
20544	Mental health problems ever diagnosed by a professional	137
20546	Substances taken for depression	138
20547	Activities undertaken to treat depression	138
20548	Manifestations of mania or irritability	139
20549	Substances taken for anxiety	140
20550	Activities undertaken to treat anxiety	140
20551	Substance of prescription or over-the-counter medication addiction	141
20552	Behavioural and miscellaneous addictions	141
20553	Methods of self-harm used	146
20554	Actions taken following self-harm	146
10691	Result ranking (pilot)	100020
10693	Acceptability of each blow result (pilot)	100020
10694	Forced vital capacity (FVC) (pilot)	100020
10695	Forced expiratory volume in 1-second (FEV1) (pilot)	100020
10696	Peak expiratory flow (PEF) (pilot)	100020
10697	Data points for blow (pilot)	100020
20032	Acceptability of each blow result (text) (pilot)	100020
10142	Number of columns displayed (pilot)	100028
10143	Number of rows displayed (pilot)	100028
10144	Time taken to complete lights test (pilot)	100028
10145	Pattern of lights displayed (pilot)	100028
10146	Pattern of lights as remembered (pilot)	100028
396	Number of columns displayed in round	100030
397	Number of rows displayed in round	100030
398	Number of correct matches in round	100030
399	Number of incorrect matches in round	100030
400	Time to complete round	100030
6334	Screen layout	100030
10133	Number of columns displayed in round (pilot)	100030
10134	Number of rows displayed in round (pilot)	100030
10136	Number of correct matches in round (pilot)	100030
10137	Number of incorrect matches in round (pilot)	100030
10138	Time to complete round (pilot)	100030
4229	Triplet played (left)	100049
4230	Signal-to-noise-ratio (SNR) of triplet (left)	100049
4232	Triplet correct (left)	100049
4233	Mean signal-to-noise ratio (SNR), (left)	100049
4234	Time to press first digit (left)	100049
4235	Time to press last digit (left)	100049
4236	Triplet entered (left)	100049
4237	Time to press 'next' (left)	100049
4238	Keystroke history (left)	100049
4239	Number of times 'clear' was pressed (left)	100049
4240	Triplet played (right)	100049
4241	Signal-to-noise-ratio (SNR) of triplet (right)	100049
4242	Triplet entered (right)	100049
4243	Triplet correct (right)	100049
4244	Mean signal-to-noise ratio (SNR), (right)	100049
4245	Time to press first digit (right)	100049
4246	Time to press last digit (right)	100049
4247	Time to press 'next' (right)	100049
4248	Keystroke history (right)	100049
4249	Number of times 'clear' was pressed (right)	100049
40002	Contributory (secondary) causes of death: ICD10	100093
22009	Genetic principal components	100313

MaximMoinat commented 2 years ago

In stem_to_baseline, the array indexes are ignored entirely. So this is not an issue.

EHDEN / ETL-UK-Biobank

ValueError columns expected but not found: ['40002-0.0'] in baseline_to_condition_occurrence #358