EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
12 stars 4 forks source link

Entire ETL crashes when one transformation throws error #357

Closed MaximMoinat closed 2 years ago

MaximMoinat commented 2 years ago

In 2022-03 run this ValueError in baseline_to_condition_occurrence stops the ETL. Expected behaviour is that only the transformation fails, but rest of the ETL continues.

2022-03-27 08:34:57,415 - INFO - Executing batched transformation: baseline_to_condition_occurrence 
2022-03-27 08:34:57,416 - INFO - Reading baseline.csv as DataFrame
Traceback (most recent call last):
  File "main.py", line 43, in <module>
    sys.exit(main(auto_envvar_prefix='ETL'))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 39, in main
    etl.run()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 66, in run
    self.transform()
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/wrapper.py", line 91, in transform
    self.execute_batch_transformation(baseline_to_condition_occurrence, bulk=True, batch_size=100000)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/orm_wrapper.py", line 126, in execute_batch_transformation
    for record in records_generator:
  File "/data/user/vaclav/ukb_etl/ETL-UK-Biobank/src/main/python/transformation/baseline_to_condition_occurrence.py", line 14, in baseline_to_condition_occurrence
    df = source.get_csv_as_df(apply_dtypes=False, usecols=['eid', '40000-0.0', '40002-0.0'])
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 119, in get_csv_as_df
    force_reload=force_reload, cache=cache, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 187, in _get_df
    df = read_func(apply_dtypes, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/delphyne/model/source_data/source_file.py", line 201, in _read_csv_as_df
    df = pd.read_csv(self._path, dtype='object', **full_kwargs)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1937, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1233, in _validate_usecols_names
    "Usecols do not match columns, "
ValueError: Usecols do not match columns, columns expected but not found: ['40002-0.0']
Anne0507 commented 2 years ago

This error occurs when running the baseline_to_condition_occurrence transformation in df = source.get_csv_as_df(apply_dtypes=False, usecols=['eid', '40000-0.0', '40002-0.0']). The field 40002-0.0 is missing, this is solved in another update #358

MaximMoinat commented 2 years ago

@Anne0507 But why does the ETL stop? It should log the error and continue with the next transformation,

Anne0507 commented 2 years ago

@MaximMoinat @SofiaMp I asked @Spayralbe how this works. Delphyne has a try/except clause for commiting a session (like inserting data), but not for the data processing steps beforehand (like the use of columns in the transformation that are not there). For this case it is solved, if we don't want the pipeline to stop we can add a try/except around the transformation itself like this example for the loading of custom vocabularies.        

try:
     self.vocab_manager.load_custom_vocab_and_stcm_tables()
except Exception as e:
     logger.warning(f'Custom vocabulary and STCM loading failed: {e.message}')
MaximMoinat commented 2 years ago

I see, thanks for looking into this. We can close this issue.