SainsburyWellcomeCentre / aeon_experiments

Experiment workflows for Project Aeon
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Social0.2 exp workflow logic stopped working on aeon3 and aeon4 around 2024-02-02 00:00:00 #498

Closed jkbhagatio closed 2 months ago

jkbhagatio commented 5 months ago

After stopping the workflows on both machines, and thereafter restarting, the workflow launched, but the env logic was not working: block timeout was static, and encoder was not being read while patch wheels were being spun.

jkbhagatio commented 5 months ago

May be related to #465

glopesdev commented 5 months ago

@aspaNeuro was mentioning that the workflow logic remained stuck even after restarting the workflow. @jkbhagatio I am assuming you deleted the temp files to resume experiments? Next time this happens can you please save those temp files in a zip file and attach them to this issue so we can hopefully have a way to reproduce this?

@aspaNeuro was also mentioning that a manual copy of data to CEPH was underway when the crash happened. In general we should not copy data manually to CEPH while experiments are running, but if we really need to I would recommend the following procedure:

  1. Stop the experiment
  2. Move the data folder from Tests to ProjectAeon
  3. Restart the experiment and the robocopy task

Robocopy will then automatically copy these files as part of the scheduled task.

Finally re. hardware or CEPH issues, next time try to notice if the computer reboot is really required, or if just deleting the temp files and the above data copy steps are enough. Just trying to isolate if this is an issue exclusively with the task logic, or with hardware / network, or an interaction between both of these.

jkbhagatio commented 5 months ago

Here are .tmp files from latest crash. tmp_crash.zip

jkbhagatio commented 5 months ago

From a second time

tmp_crash2.zip

jkbhagatio commented 5 months ago

Seems likely that it was triggered by maintenance / environment config loading

glopesdev commented 5 months ago

During tests today we seem to have found a sequence to somewhat reliably reproduce the issue:

  1. Delete all temp files (ResetStateRecovery.cmd)
  2. Start the workflow
  3. Start the cameras
  4. Add a subject
  5. Reload Environment

Unfortunately the issue is not deterministic, so it must be a race condition. We did however determine that we can replicate the issue without changing the maintenance state, so this rules out any causal link with start/stop maintenance. Reload environment and adding subjects seem to be necessary at this point.

jkbhagatio commented 2 months ago

Reopening this as @lorycalca noticed this issue again in recent phields test experiments in aeon4

glopesdev commented 2 months ago

Duplicate of #514