KeyError when running on Discovery

asmacdo commented 1 year ago

The job ran a while, but eventually failed.

stderr:

(asmacdo-datalad2) [f006rq8@discovery7 20230706-100912-74dd]$ cat stderr.0 
INFO:    underlay of /usr/share/opfvta_bidsdata required more than 50 (115) bind mounts
Failed to import duecredit due to No module named 'duecredit'
/usr/lib/python3.10/site-packages/bids/grabbids/__init__.py:6: FutureWarning: grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8
  warnings.warn("grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8", FutureWarning)
/usr/lib/python3.10/site-packages/nilearn/input_data/__init__.py:27: FutureWarning: The import path 'nilearn.input_data' is deprecated in version 0.9. Importing from 'nilearn.input_data' will be possible at least until release 0.13.0. Please import from 'nilearn.maskers' instead.
  warnings.warn(message, FutureWarning)
/usr/lib/python3.10/site-packages/bids/layout/bids_layout.py:116: UserWarning: 'dataset_description.json' file is missing from project root. You may want to set the root path to a valid BIDS project.
  warnings.warn("'dataset_description.json' file is missing from "
/usr/lib/python3.10/site-packages/samri/pipelines/utils.py:283: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  res_df = res_df.append(_df)
/usr/lib/python3.10/site-packages/bids/layout/bids_layout.py:116: UserWarning: 'dataset_description.json' file is missing from project root. You may want to set the root path to a valid BIDS project.
  warnings.warn("'dataset_description.json' file is missing from "
/usr/lib/python3.10/site-packages/samri/pipelines/utils.py:283: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  res_df = res_df.append(_df)
Failed to import duecredit due to No module named 'duecredit'
/usr/lib/python3.10/site-packages/bids/grabbids/__init__.py:6: FutureWarning: grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8
  warnings.warn("grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8", FutureWarning)
/usr/lib/python3.10/site-packages/nilearn/input_data/__init__.py:27: FutureWarning: The import path 'nilearn.input_data' is deprecated in version 0.9. Importing from 'nilearn.input_data' will be possible at least until release 0.13.0. Please import from 'nilearn.maskers' instead.
  warnings.warn(message, FutureWarning)
/usr/lib/python3.10/site-packages/bids/layout/bids_layout.py:116: UserWarning: 'dataset_description.json' file is missing from project root. You may want to set the root path to a valid BIDS project.
  warnings.warn("'dataset_description.json' file is missing from "
/usr/lib/python3.10/site-packages/nilearn/input_data/__init__.py:27: FutureWarning: The import path 'nilearn.input_data' is deprecated in version 0.9. Importing from 'nilearn.input_data' will be possible at least until release 0.13.0. Please import from 'nilearn.maskers' instead.
  warnings.warn(message, FutureWarning)
Failed to import duecredit due to No module named 'duecredit'
/usr/lib/python3.10/site-packages/bids/grabbids/__init__.py:6: FutureWarning: grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8
  warnings.warn("grabbids has been renamed to layout in version 0.6.5, and will be removed in version 0.8", FutureWarning)
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'VTA t'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/src/opfvta/prepare/implant_coordinates.py", line 69, in <module>
    make_summary(df, i)
  File "/opt/src/opfvta/prepare/implant_coordinates.py", line 30, in make_summary
    values = coordinates['VTA t'].values
  File "/usr/lib/python3.10/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'VTA t'
make[2]: *** [Makefile:69: data] Error 1
running: pythontex article
execution of pythontex failed
There were errors compiling article.tex: Recipe for pythontex-files-article/article.pytxmcr failed.
error: Stopping because of compilation errors.
make[2]: *** [Makefile:24: article.pdf] Error 2

TheChymera commented 1 year ago

Going by our debugging today, it seems discovery was killing our job. Pls attempt re-run after https://github.com/con/opfvta-replication-2023/commit/74e2d07a168089c4fda613abfd1ca82f1ed935f7 , which includes an upstream opfvta fix, that might make the jobs load discovery-amenable.

yarikoptic commented 1 year ago

it seems discovery was killing our job.

out of time? out of memory? out of ... ? anything in the logs

asmacdo commented 1 year ago

Out of memory we believe.

The logs indicate that there was no data for one of the steps. This indicates that the step generating that data failed, which is the high memory bottleneck. Nothing in the logs explicitly said OOM, but its a reasonable guess.

To address this, we are dropping down a limiting constant from 0.75 -> 0.5 in opfvta that is passed to SAMRI. (We are also filing an issue to make this configurable). Last time it was running 32 threads IIRC. Hopefully this will drop down.

[X] Update opfvta https://bitbucket.org/TheChymera/opfvta/commits/656cd87d6efb4782f149a858f7f2a65dadc510be
[X] update opfvta commit for top level repo https://github.com/con/opfvta-replication-2023/commit/74e2d07a168089c4fda613abfd1ca82f1ed935f7
[X] merge temp discovery changes upstream
[x] Quick tidy of makefile for sanity
[x] rebuild oci container
[X] dockerhub push
[X] rebuild apptainer image from docker:// (Note: typhon's /tmp is not big enough, export SINGULARITY_TMPDIR="/home/asmacdo/tmp" fixes)
[x] push new singularity image to gin
[ ] Orchestrate(update to discovery) dataset
[ ] Edit reproman job to put a hard cap at 400Gb RAM
[ ] deploy reproman job

asmacdo commented 1 year ago

I have created https://github.com/con/opfvta-replication-2023/issues/18 to track how we might fix the underlying issue. Its obviously not ideal to have to do each of these steps and move around all those gigabytes just to change a config value.

yarikoptic commented 1 year ago

@TheChymera what those KeyError's could be due to?

asmacdo commented 1 year ago

The KeyError occurs in the step after the actual failure. It is attempting to read in data that wasn't produced.

yarikoptic commented 1 year ago

if error happens - why script continues? is it a shell script ? then add set -eu on top please. May be even set -x so we could see what commands are ran

asmacdo commented 1 year ago

I think this will do it? https://bitbucket.org/TheChymera/opfvta/pull-requests/5

TheChymera commented 1 year ago

@asmacdo why did you close the PR, it says “@yarikoptic did better”, where?

asmacdo commented 1 year ago

yarik left another pr that did the same change and more

TheChymera commented 1 year ago

@asmacdo could you link me to it?

asmacdo commented 1 year ago

https://bitbucket.org/TheChymera/opfvta/pull-requests/6

TheChymera commented 1 year ago

@asmacdo merged since it's just good coding style. But are you sure this fixes it, though? Has nothing to do with the number of cores used. It just makes sure that it fails more robustly.

asmacdo commented 1 year ago

Unfortunately, we have not fixed the issue. Here's the full stderr and stdout, but it looks like its the same problem.

stdout.0.txt

stderr.0.txt

TheChymera commented 1 year ago

Actually, I'm pretty sure we fixed this issue. It's just that we ran into a brand new one which may or may not be similar. Feel free to reopen if this turns out to be the same thing.

con / opfvta-reexecution

KeyError when running on Discovery #16