Closed asmacdo closed 1 year ago
Going by our debugging today, it seems discovery was killing our job. Pls attempt re-run after https://github.com/con/opfvta-replication-2023/commit/74e2d07a168089c4fda613abfd1ca82f1ed935f7 , which includes an upstream opfvta fix, that might make the jobs load discovery-amenable.
it seems discovery was killing our job.
out of time? out of memory? out of ... ? anything in the logs
Out of memory we believe.
The logs indicate that there was no data for one of the steps. This indicates that the step generating that data failed, which is the high memory bottleneck. Nothing in the logs explicitly said OOM, but its a reasonable guess.
To address this, we are dropping down a limiting constant from 0.75 -> 0.5 in opfvta that is passed to SAMRI. (We are also filing an issue to make this configurable). Last time it was running 32 threads IIRC. Hopefully this will drop down.
[X] Update opfvta https://bitbucket.org/TheChymera/opfvta/commits/656cd87d6efb4782f149a858f7f2a65dadc510be
[X] update opfvta commit for top level repo https://github.com/con/opfvta-replication-2023/commit/74e2d07a168089c4fda613abfd1ca82f1ed935f7
[X] merge temp discovery changes upstream
[x] Quick tidy of makefile for sanity
[x] rebuild oci container
[X] dockerhub push
[X] rebuild apptainer image from docker:// (Note: typhon's /tmp is not big enough, export SINGULARITY_TMPDIR="/home/asmacdo/tmp"
fixes)
[x] push new singularity image to gin
[ ] Orchestrate(update to discovery) dataset
[ ] Edit reproman job to put a hard cap at 400Gb RAM
[ ] deploy reproman job
I have created https://github.com/con/opfvta-replication-2023/issues/18 to track how we might fix the underlying issue. Its obviously not ideal to have to do each of these steps and move around all those gigabytes just to change a config value.
@TheChymera what those KeyError's could be due to?
The KeyError occurs in the step after the actual failure. It is attempting to read in data that wasn't produced.
if error happens - why script continues? is it a shell script ? then add set -eu
on top please. May be even set -x
so we could see what commands are ran
I think this will do it? https://bitbucket.org/TheChymera/opfvta/pull-requests/5
@asmacdo why did you close the PR, it says “@yarikoptic did better”, where?
yarik left another pr that did the same change and more
@asmacdo could you link me to it?
@asmacdo merged since it's just good coding style. But are you sure this fixes it, though? Has nothing to do with the number of cores used. It just makes sure that it fails more robustly.
Unfortunately, we have not fixed the issue. Here's the full stderr and stdout, but it looks like its the same problem.
Actually, I'm pretty sure we fixed this issue. It's just that we ran into a brand new one which may or may not be similar. Feel free to reopen if this turns out to be the same thing.
The job ran a while, but eventually failed.
stderr: