blumzi / LAST_issues

A place to discuss and manage LAST issues
0 stars 0 forks source link

Need a dogwatch for pipeline #13

Closed EranOfek closed 2 months ago

EranOfek commented 8 months ago

The problem: MATLAB stack, so the process still allives but not operational. Suggested solution: The pipeline will write a status file (where?) every <30 min. A crontab script will check when this file was last updated and if >40min, then will kill the matlab process. Next, the pipeline service will start a new process.

blumzi commented 8 months ago

last-pipeline is a systemctld service. the service can be configured to expect a notification at a pre-defined time interval and restart the service if not notified in time. To achieve this we need the pipeline to call the _sdnotify C library function periodically. If you can call this function periodically within the pipeline, I can add the implementation.

EastEriq commented 8 months ago

Do you mean perhaps shelling in matlab system systemd-notify --whatever_options, instead of calling a .so library function? Interfacing with a library is possible in matlab, we do it all the time with SDKs, but involves complication. https://askubuntu.com/questions/1120023/how-to-use-systemd-notify

blumzi commented 8 months ago

That's possible as well, somewhat less accurate and will open an additional process, but we can use it.

On Fri, Jan 5, 2024 at 11:13 AM EastEriq @.***> wrote:

Do you mean perhaps shelling in matlab system systemd-notify --whatever_options, instead of calling a .so library function? Interfacing with a library is possible in matlab, we do it al the time with SDKs, but involves complication. https://askubuntu.com/questions/1120023/how-to-use-systemd-notify

— Reply to this email directly, view it on GitHub https://github.com/blumzi/LAST_issues/issues/13#issuecomment-1878352115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFHQOISQTMC67E6SOOPKILYM674VAVCNFSM6AAAAABBOA3YQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGM2TEMJRGU . You are receiving this because you commented.Message ID: @.***>

blumzi commented 8 months ago

Pushed the following solution:

[Service] User=ocs WorkingDirectory=/home/ocs/matlab ExecStart=/usr/local/share/last-tool/bin/last-pipeline start 1 ExecStop=/usr/local/share/last-tool/bin/last-pipeline stop 1 Restart=always Environment="SYSTEMD=1" WatchdogSec=1800

[Install] WantedBy=multi-user.target

* The (new) tool `last-pipeline` can now show the status of the last-pipeline services.  The following is an example of both not running:
```bash
ocs@last12w:/home/ocs# last-pipeline status
Unit last-pipeline1.service could not be found.
Unit last-pipeline2.service could not be found.
EranOfek commented 8 months ago

Thanks - can you move this function to the tools.os

On Tue, Jan 16, 2024 at 1:35 PM Arie Blumenzweig @.***> wrote:

Pushed the following solution:

-

As discussed with Eran

  • We will have two systemd services (i.e. last-pipeline1 and last-pipeline2), one per DataDir. Each will be individually be monitored by systemd.
    • We may have more in the future (e.g. PAST)
  • Added to AstroPack capability to send sd_notify messages

  • When each of the last-pipeline[12] services gets started, the bash script intrinsically calls tools.systemd.mex.notify_ready which informs systemd:
    • That the service is ready and will start it's main workload
    • What process ID needs to be monitored
      • The pipeline (matlab) code is responsible to call tools.systemd.mex.notify_watchdog at intervals of less than 1800 seconds. It can know if it was ran by systemd by checking the existence of the environment variable SYSTEMD, but the notify_xxx functions will do nothing if it is not set (so they're safe-to-call in a regular matlab session)
  • The systemd service files (/etc/systemd/system/last-pipeline[12]) now look as follows:

[Unit] Description=LAST pipeline service (1 of 2)

[Service] User=ocs WorkingDirectory=/home/ocs/matlab ExecStart=/usr/local/share/last-tool/bin/last-pipeline start 1 ExecStop=/usr/local/share/last-tool/bin/last-pipeline stop 1 Restart=always Environment="SYSTEMD=1" WatchdogSec=1800

[Install] WantedBy=multi-user.target

  • The (new) tool last-pipeline can now show the status of the last-pipeline services. The following is an example of both not running:

@.***:/home/ocs# last-pipeline status Unit last-pipeline1.service could not be found. Unit last-pipeline2.service could not be found.

— Reply to this email directly, view it on GitHub https://github.com/blumzi/LAST_issues/issues/13#issuecomment-1893566920, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJUQ4PXH3IHU2ZQ4QQSXUDYOZQWLAVCNFSM6AAAAABBOA3YQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJTGU3DMOJSGA . You are receiving this because you authored the thread.Message ID: @.***>

EranOfek commented 2 months ago

last-pipeline1/2 is now a service