Closed jkbhagatio closed 2 months ago
May be related to #465
@aspaNeuro was mentioning that the workflow logic remained stuck even after restarting the workflow. @jkbhagatio I am assuming you deleted the temp files to resume experiments? Next time this happens can you please save those temp files in a zip file and attach them to this issue so we can hopefully have a way to reproduce this?
@aspaNeuro was also mentioning that a manual copy of data to CEPH was underway when the crash happened. In general we should not copy data manually to CEPH while experiments are running, but if we really need to I would recommend the following procedure:
Tests
to ProjectAeon
Robocopy will then automatically copy these files as part of the scheduled task.
Finally re. hardware or CEPH issues, next time try to notice if the computer reboot is really required, or if just deleting the temp files and the above data copy steps are enough. Just trying to isolate if this is an issue exclusively with the task logic, or with hardware / network, or an interaction between both of these.
Here are .tmp files from latest crash. tmp_crash.zip
From a second time
Seems likely that it was triggered by maintenance / environment config loading
During tests today we seem to have found a sequence to somewhat reliably reproduce the issue:
ResetStateRecovery.cmd
)Unfortunately the issue is not deterministic, so it must be a race condition. We did however determine that we can replicate the issue without changing the maintenance state, so this rules out any causal link with start/stop maintenance. Reload environment and adding subjects seem to be necessary at this point.
Reopening this as @lorycalca noticed this issue again in recent phields test experiments in aeon4
Duplicate of #514
After stopping the workflows on both machines, and thereafter restarting, the workflow launched, but the env logic was not working: block timeout was static, and encoder was not being read while patch wheels were being spun.