Alabos hot restart - Githubissues

odartsi commented 4 months ago

Steps to follow:

[x] 1. Remove https://github.com/CederGroupHub/alabos/blob/main/alab_management/device_manager.py completely. We will create the instances of devices in each task every time the task occupies the device. (@idocx )
1. Possible issues: currently some devices have background threads running, e.g., LabmanQuadrant. But this is in general a bad way to handle background tasks as it is hard for debugging purpose.
2. Device manager is only useful for monitoring state of devices such as glovebox(robotbox) argon flow. Therefore, in the future we will implement a separate thread to do this. I In each device we will have method call “check_status()” to check all parameters if its in the correct range.
3. Steps:
  1. Package affected:
    1. DeviceManager: /scripts/launch_lab.py
      1. Find all that connects to DeviceManager
    2. DeviceClient: /lab_view.py → NOT GONE, just refactor
[ ] 2. Implement reload option for importing alab_one package. Currently, the alab_one package is imported to AlabOS process via https://github.com/CederGroupHub/alabos/blob/main/alab_management/utils/module_ops.py#L12. We will need to implement something similar to importlib.reload function. The new function should have such signature. (@bernardusrendy )
```
def import_module_from_path(
    path: str | Path, 
    parent_package: str | None = None, 
    reload: bool = False
):
    ...
    ## get package_name from path
    package_name = ...
    if reload:
        importlib.reload(package_name)
```
[x] 3. Implement process restart for AlabOS. This will be done via https://github.com/CederGroupHub/alabos/blob/main/alab_management/scripts/launch_lab.py#L70. Currently, there are four processes running. We will only need to restart them at a regular interval by adding a live_time argument to each manager class, e.g., (@odartsi )

class TaskManager:
    def __init__(self, live_time: float | None = None):
          ...
          self.live_time = live_time

      def run(self):
          start = time.time()
          while (time.time() - start) < self.live_time:
              self._loop()

Then in the launch_lab function, we will need to start them process if it exits normally.

bernardusrendy commented 4 months ago

New problem: For tasks that have already been created and under the status WAITING/READY, it has not been ran in dramatiq actor run_task.

Note that load_definition has not been called for those tasks.

Therefore, these WAITING tasks have a risk of mismatch in tasks parameters with what was defined when it was submitted.

For example, if we submitted a sample with Heating(time=720) and it is WAITING. We then update the Heating which does not accept time argument anymore, the old sample will run into an error.

Proposed solution: This problem is fundamentally about versioning. We will solve this by keeping a local copy of older versions for each update of the task.

bernardusrendy commented 4 months ago

TODOs:

Following this update, alab_one device definition should be updated to not contain any threading.
More to come..

CederGroupHub / alabos

Alabos hot restart #75