EBISPOT / scxa_2_cxg

Apache License 2.0
1 stars 0 forks source link

Error when processing E-MTAB-9444: “KeyError: 'MAGE-TAB Version'” #36

Closed gouttegd closed 2 months ago

gouttegd commented 3 months ago

The following error happens when trying to process the E-MTAB-9444 dataset:

$ poetry run python src/bulk_experiments.py --study_filter E-MTAB-9444 --chunk_size 10 --download
INFO:root:Processing study: E-MTAB-9444
INFO:root:Downloading files...
[… OUTPUT TRUNCATED FOR BREVITY …]
Traceback (most recent call last):
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'MAGE-TAB Version'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/bulk_experiments.py", line 90, in <module>
    bulk_process(args.study_filter, args.chunk_size, args.download, args.modified)
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/bulk_experiments.py", line 73, in bulk_process
    convert_and_save(study)
  File "/Users/dpg44/Development/Python/scxa_2_cxg/src/scxa2cxg.py", line 171, in convert_and_save
    metadata.set_index(metadata["MAGE-TAB Version"], inplace=True)
                       ~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dpg44/Library/Caches/pypoetry/virtualenvs/scxa-kg-yaacZmVL-py3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'MAGE-TAB Version'

This is because the IDF file in MTAB-9444 starts like this:

Comment[ArrayExpressAccession]  E-MTAB-9444
MAGE-TAB Version        1.1

The MAGE-TAB Version key is only on the second row, whereas this line in the convert_and_save method

metadata.set_index(metadata["MAGE-TAB Version"], inplace=True)

seemingly assumes that MAGE-TAB Version would always appear first in the IDF file.

Based on the specification of the MAGE-TAB format, I don’t think that assumption is warranted. Unless I missed something, the specification does not mandate a particular order for the fields that make up the IDF file, and does not mandate that MAGE-TAB Version has to always be in the first position.