This will probably be converted to an epic but I just want to layout some procedural steps required to get to a point where we can determine why processing is failing.
One challenge while trying to debug this problem is that the errors doesn't seem to be reproducible locally. At least I have had trouble creating tests that capture the same errors (seeing different failure reasons). Additionally, logs are not perpetually available on staging which makes jumping back into debugging difficult because it requires knowing the experiment that triggers the error. This can be resolved with a little bit more organization.
For now, we should clean up staging and go clean slate. From there break up all (presumably processable) experiments. And run them more slowly to attempt to reproduce locally.
From the errors that are generated (we should expect there to be errors because there always are), we need to open issues that capture the logs so we can implement a fix if appropriate.
Solution or next step
[ ] unsurvey all requested experiments on staging
[ ] generate input files to be surveyed manually (x5 experiments per)
[ ] manually queue new smaller batches of experiments
Context
This will probably be converted to an epic but I just want to layout some procedural steps required to get to a point where we can determine why processing is failing.
One challenge while trying to debug this problem is that the errors doesn't seem to be reproducible locally. At least I have had trouble creating tests that capture the same errors (seeing different failure reasons). Additionally, logs are not perpetually available on staging which makes jumping back into debugging difficult because it requires knowing the experiment that triggers the error. This can be resolved with a little bit more organization.
For now, we should clean up staging and go clean slate. From there break up all (presumably processable) experiments. And run them more slowly to attempt to reproduce locally.
From the errors that are generated (we should expect there to be errors because there always are), we need to open issues that capture the logs so we can implement a fix if appropriate.
Solution or next step