Unable to sync site as daemon hangs

On alcf-theta-knl the error is:

(py3) sajid@thetalogin5:/projects/HEP_on_HPC/sajid/icarus_hepnos/balsam> balsam site sync
Updated site.
Restarting Site /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/balsam
Sent SIGTERM to Balsam site daemon [pid 7442]
Waiting for site /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/balsam to shutdown...
  [#################################---]   91%  00:00:01
Usage: balsam site sync [OPTIONS]
Try 'balsam site sync --help' for help.

Error: Site daemon did not shut down gracefully on its own; please kill it manually and delete /lus/theta-fs0/projects/HEP_on_HPC/sajid/icarus_hepnos/balsam/balsam-service.pid
(py3) sajid@thetalogin5:/projects/HEP_on_HPC/sajid/icarus_hepnos/balsam>

A daemon that hangs causes job creating to fail silently, i.e. a job with no pre-processing step shows up as STAGED_IN instead of PREPROCESSED causing the workflow to fail.

Is there a way to enable better diagnostics to warn users when the daemon hangs? I'm seeing this occur frequently on alcf-theta-knl.

Thanks for flagging this @s-sajid-ali . What version of Balsam are you running? You can query the version at the command line with balsam --version.

balsam, version 0.7.0.a22

A daemon that hangs causes job creating to fail silently, i.e. a job with no pre-processing step shows up as STAGED_IN instead of PREPROCESSED causing the workflow to fail.

This bug is occurring more frequently than expected and I'd like to fix this before submitting some large workloads. One approach I'd tried was to see if I could manually push the job toward the PREPROCESSED state via:

balsam job modify --id ... -s PREPROCESSED

and while this does set the job to the correct state, a corresponding workdir is not created. Submitting a new queue with this job seems to attempt to start the job (per the mpi_mode logs), but neither theworkdir nor a corresponding log is created. Could you let me know if I'm missing some step that creates the associated workdir with the job stage increment? Thanks!

While I see that all jobs inherit an empty preprocess() function that simply advances a job from STAGED_IN to PREPROCESSED phase, I'm confused by the fact that a script used to create the jobs sometimes creates it in the STAGED_IN phase (which leads to the whole workflow stalling as this is the base job) and at other times creates it in the PREPROCESSED phase (which is the correct phase for this job and this allows the workflow to run as designed).

Could you give me some pointers on how the job creation process decides the phase of the job and I can try debugging it to understand the root cause of this behavior? Thanks in advance!

With a known pid from balsam-service.pid, when I try to replicate the code in balsam.cmdline.utils.kill_site, I see the following stack trace:

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> python 
Python 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psutil
>>> service_proc = psutil.Process(pid=8039)
>>> service_proc.terminate()
>>> service_proc.wait(timeout=1)
Traceback (most recent call last):
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/_psposix.py", line 121, in wait_pid
    retpid, status = os.waitpid(pid, flags)
ChildProcessError: [Errno 10] No child processes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/__init__.py", line 1265, in wait
    self._exitcode = self._proc.wait(timeout)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/_pslinux.py", line 1645, in wrapper
    return fun(self, *args, **kwargs)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/_pslinux.py", line 1851, in wait
    return _psposix.wait_pid(self.pid, timeout, self._name)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/_psposix.py", line 132, in wait_pid
    interval = sleep(interval)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.10/site-packages/psutil/_psposix.py", line 114, in sleep
    raise TimeoutExpired(timeout, pid=pid, name=proc_name)
psutil.TimeoutExpired: timeout after 1 seconds (pid=8039)
>>>

More specifically, it looks like some child processes are in a zombie state:

>>> service_proc.children()
[
psutil.Process(pid=8043, name='python', status='sleeping', started='21:54:10'), 
psutil.Process(pid=8044, name='python', status='zombie', started='21:54:10'), 
psutil.Process(pid=8045, name='python', status='sleeping', started='21:54:10'), 
psutil.Process(pid=8046, name='python', status='zombie', started='21:54:10')
]
>>>

ps -aux shows that these processes are defunct:

sajid     8039  0.0  0.0 240192 37812 pts/12   Sl   21:54   0:02 /lus/swift/home/sajid/packages/mambaforge/envs/py3/bin/python -m balsam.site.service.main
sajid     8043  0.0  0.0 240192 32740 pts/12   S    21:54   0:00 /lus/swift/home/sajid/packages/mambaforge/envs/py3/bin/python -m balsam.site.service.main
sajid     8044  0.0  0.0      0     0 pts/12   Z    21:54   0:00 [python] <defunct>
sajid     8045  0.0  0.0 240192 33108 pts/12   Sl   21:54   0:00 /lus/swift/home/sajid/packages/mambaforge/envs/py3/bin/python -m balsam.site.service.main
sajid     8046  0.0  0.0      0     0 pts/12   Z    21:54   0:00 [python] <defunct>

os.kill(..., signal.SIGKILL) does succeed in terminating the parent process, but leaves one child process behind (shown below):

sajid     8045  0.0  0.0 240192 33268 pts/12   Sl   21:54   0:00 /lus/swift/home/sajid/packages/mambaforge/envs/py3/bin/python -m balsam.site.service.main

Okay, I agree that this issue is probably related to #364.

To find the process group id, you can do this at the command line: ps -eo pgrp,pid | grep <balsam pid>

The first number is the process group id. You can kill all processes in the group (so should include child processes) by typing this at the command line: kill -9 -<process group id>

A few questions:

I see that in the different command line snippets you've pasted that you are switching between login nodes 4 and 6. Is it possible that you have daemon processes running on both nodes? A clue to this would be if you have 2 service logs in the log site folder being updated.
After killing all the processes associated with the process group (see message above), if you remove the balsam-service.pid file and start the site, what happens?

To answer some of your questions about job creation:

Unless the job has a transfer as part of it (which I don't think you have?), jobs are all created in the STAGED_IN state. If there is no user specified preprocess() as part of the application definition (your case), the jobs should advance almost immediately to the PREPROCESSED state. So I think when you observe that some of your jobs are created in the PREPROCESSED state, you are actually just missing seeing the staged in state because it is so brief.

So what prevents the jobs to advancing to the PREPROCESSED state? Either somehow the zombie process is not running the code that advances the job state or there's an uncaught exception in the code that advances the job state. Look in your most recent service logs in the log directory (they will have file names like log/service_<date & time>.log). Are there any errors there or python tracebacks?

Thanks a lot for the all the pointers! I haven't seen the service hang today but I'll inspect the logs and provide them if it hangs again.

I see that in the different command line snippets you've pasted that you are switching between login nodes 4 and 6. Is it possible that you have daemon processes running on both nodes?

When the service hangs, I typically do pkill -u ... to cleanup all processes I have open and login again to ensure that a stray service thread is not left running. I can add an extra cautionary step of logging back in and checking if there are any stray/zombie processes that have been leftover from the previous session.

Thanks for the detailed explanation on the job creation process. I did notice that the steps with a non-trivial preprocess was logged in the service logs, but I don't remember seeing any exceptions thrown from the thread that was supposed to advance job that was stuck at the STAGED_IN state. I'll be sure to check the logs more thoroughly when this occurs again.

Okay, if you have a chance to look in the logs and you find anything, pass it along. If there's an uncaught exception happening in the code that advances jobs, we'd like to know so we can fix it.

Apologies for the erroneous post about a failure, I saw a mistake in the logic to generate jobs with the same name. I'll post logs if I see a job getting stuck after fixing it.

I'd forgotten to change the workdir when generating the jobs with different parameters and that might have been the cause behind the second job of the same name not being advanced.

I saw this error when trying to delete a site with stuck jobs (I wanted to have a new site to reproduce the bug with minimal logs to reduce the amount of logs to sift through):

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos> balsam site rm /projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam
Do you really want to destroy split_test_balsam? This will wipe out 22 jobs inside! [y/N]: y
Deleted site /projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos> Exception in thread Thread-1:
Traceback (most recent call last):
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 97, in request
    response = self._do_request(absolute_url, http_method, params, json, data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 129, in _do_request
    self._raise_with_explanation(response)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 149, in _raise_with_explanation
    response.raise_for_status()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found
{'error': 'Not found'} for url: https://balsam-dev.alcf.anl.gov/sessions/17995

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/site/job_source.py", line 49, in run
    self.session.tick()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/_api/bases.py", line 394, in tick
    return self.__class__.objects._do_tick(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/_api/bases.py", line 411, in _do_tick
    self._client.put(self._api_path + f"{instance.id}")
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/rest_base_client.py", line 77, in put
    return self.request(url, "PUT", json=jsonable_encoder(kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 103, in request
    self.backoff(exc)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 76, in backoff
    raise TimeoutError(f"Exceeded max retries: {reason}")
TimeoutError: Exceeded max retries: 404 Client Error: Not Found
{'error': 'Not found'} for url: https://balsam-dev.alcf.anl.gov/sessions/17995
Process FixedDepthJobSource-2:
Traceback (most recent call last):
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 97, in request
    response = self._do_request(absolute_url, http_method, params, json, data)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 129, in _do_request
    self._raise_with_explanation(response)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 149, in _raise_with_explanation
    response.raise_for_status()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found
{'error': 'Not found'} for url: https://balsam-dev.alcf.anl.gov/sessions/17995

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/util/process.py", line 17, in run
    self._run()
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/site/job_source.py", line 135, in _run
    jobs = self.session.acquire_jobs(**params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/_api/bases.py", line 381, in acquire_jobs
    return self.__class__.objects._do_acquire(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/_api/bases.py", line 406, in _do_acquire
    acquired_raw = self._client.post(self._api_path + f"{instance.id}", **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/rest_base_client.py", line 71, in post
    return self.request(url, "POST", json=jsonable_encoder(kwargs), authenticating=authenticating)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 103, in request
    self.backoff(exc)
  File "/lus/swift/home/sajid/packages/mambaforge/envs/py3/lib/python3.11/site-packages/balsam/client/requests_client.py", line 76, in backoff
    raise TimeoutError(f"Exceeded max retries: {reason}")
TimeoutError: Exceeded max retries: 404 Client Error: Not Found
{'error': 'Not found'} for url: https://balsam-dev.alcf.anl.gov/sessions/17995

I cancelled the operation and stopped the (zombie) processes individually. Could the above stack trace be the reason behind some of the python threads being created by the balsam client becoming zombie processes?

With a fresh site, creating two identical sets of jobs with the only definition being a change in a parameter, I see that one of the jobs to start a server is created in the PREPROCESSED state, while the other is created in a STAGED_IN state (i.e. stuck there after ~5 minutes of creation, while it should have moved to the PREPROCESSED state within a minute):

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls
ID         Site                App               Workdir                           State              Tags  
32245737   split_test_balsam   HEPnOS_server     server_hepnos-sajid-1             PREPROCESSED       {}    
32245738   split_test_balsam   HEPnOS_list_dbs   connection_hepnos-sajid-1         STAGED_IN          {}    
32245739   split_test_balsam   HEPnOS_queue      queue_hepnos-sajid-1              AWAITING_PARENTS   {}    
32245740   split_test_balsam   HEPnOS_load       detsim_00.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245741   split_test_balsam   HEPnOS_load       detsim_00.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245742   split_test_balsam   HEPnOS_load       detsim_01.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245743   split_test_balsam   HEPnOS_load       detsim_01.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245744   split_test_balsam   HEPnOS_load       detsim_02.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245745   split_test_balsam   HEPnOS_load       detsim_02.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245746   split_test_balsam   HEPnOS_process    process_hepnos-sajid-1            AWAITING_PARENTS   {}    
32245747   split_test_balsam   HEPnOS_shutdown   shutdown_hepnos-sajid-1           AWAITING_PARENTS   {}    
32245748   split_test_balsam   HEPnOS_server     server_hepnos-sajid-2             STAGED_IN          {}    
32245749   split_test_balsam   HEPnOS_list_dbs   connection_hepnos-sajid-2         STAGED_IN          {}    
32245750   split_test_balsam   HEPnOS_queue      queue_hepnos-sajid-1              AWAITING_PARENTS   {}    
32245751   split_test_balsam   HEPnOS_load       detsim_00.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245752   split_test_balsam   HEPnOS_load       detsim_00.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245753   split_test_balsam   HEPnOS_load       detsim_01.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245754   split_test_balsam   HEPnOS_load       detsim_01.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245755   split_test_balsam   HEPnOS_load       detsim_02.root_0_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245756   split_test_balsam   HEPnOS_load       detsim_02.root_1_hepnos-sajid-1   AWAITING_PARENTS   {}    
32245757   split_test_balsam   HEPnOS_process    process_hepnos-sajid-1            AWAITING_PARENTS   {}    
32245758   split_test_balsam   HEPnOS_shutdown   shutdown_hepnos-sajid-1           AWAITING_PARENTS   {}    
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam>

The only difference between the two jobs to start the server being the different parameter and the different workdir:

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> diff <(balsam job ls -v --id 32245737) <(balsam job ls -v --id 32245748)
1c1
< workdir: server_hepnos-sajid-1
---
> workdir: server_hepnos-sajid-2
3c3
< serialized_parameters: gASVewAAAAAAAAB9lCiMB3Bkb21haW6UjA5oZXBub3Mtc2FqaWQtMZSMDWhlcG5vc19jb25maWeUjEgvcHJvamVjdHMvSEVQX29uX0hQQy9zYWppZC9pY2FydXNfaGVwbm9zL2ljYXJ1c3dmL2J1aWxkL3Rlc3QvaGVwbm9zLmpzb26UdS4=
---
> serialized_parameters: gASVewAAAAAAAAB9lCiMB3Bkb21haW6UjA5oZXBub3Mtc2FqaWQtMpSMDWhlcG5vc19jb25maWeUjEgvcHJvamVjdHMvSEVQX29uX0hQQy9zYWppZC9pY2FydXNfaGVwbm9zL2ljYXJ1c3dmL2J1aWxkL3Rlc3QvaGVwbm9zLmpzb26UdS4=
14c14
< id: 32245737
---
> id: 32245748
18,19c18,19
< last_update: '2023-07-03T17:33:11.414464+00:00'
< state: PREPROCESSED
---
> last_update: '2023-07-03T17:33:23.164831+00:00'
> state: STAGED_IN
24c24
< deserialized parameters: {'pdomain': 'hepnos-sajid-1', 'hepnos_config': '/projects/HEP_on_HPC/sajid/icarus_hepnos/icaruswf/build/test/hepnos.json'}
---
> deserialized parameters: {'pdomain': 'hepnos-sajid-2', 'hepnos_config': '/projects/HEP_on_HPC/sajid/icarus_hepnos/icaruswf/build/test/hepnos.json'}
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam>

Here are the two service logs from the site: [1] service_2023-07-03_173234.log [2] service_2023-07-03_173247.log

I'd run balsam site sync to push some changes to the server which is why there are two log files. Most of the relevant information to look into would be in the second log file.

PS: I tried moving the job 32245748 to the PREPROCESSED state via balsam job modify --id 32245748 -s PREPROCESSED and while balsam job ls shows that it is now in the PREPROCESSED state, the corresponding workdir has not been created. Neither do the logs have any record of this change:

(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> balsam job ls | grep 32245748
32245748   split_test_balsam   HEPnOS_server     server_hepnos-sajid-2             PREPROCESSED       {}    
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> cat log/service_2023-07-03_173247.log | grep 32245748
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> ls data/
connection_hepnos-sajid-1/  server_hepnos-sajid-1/
(py3) sajid@thetalogin4:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam>

Confirming the hypothesis that the root cause of the issue is a subset of threads of the balsam client becoming unresponsive, balsam site sync fails with the same message as stated in the earlier comments on this issue. As described earlier, listing the processes shows a couple of zombie/defunct python threads that are terminated when the parent process is terminated. Unfortunately, these threads do not write anything to the logs before they get suspended, leaving no clue as to what caused their suspension.

More specifically, the threads associated with ProcessingService are the ones that become unresponsive as the logs show graceful termination of the TransferService and the SchedulerService.

Hi @s-sajid-ali sorry for the delay. I'm going to try to track down those backtraces, but a few more questions:

in the workdir for each job there should be a file called balsam.log. In the workdir of job 32245748 (the job that's stuck in a staged in state), what's in that balsam.log file?
Are you able to share the script you are using to create the jobs? I'd like to try and reproduce this on theta.

in the workdir for each job there should be a file called balsam.log. In the workdir of job 32245748 (the job that's stuck in a staged in state), what's in that balsam.log file?

The directory for that job was not created:

(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> pwd
/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam> ls data/
connection_hepnos-sajid-1/  connection_hepnos-sajid-2/  server_hepnos-sajid-1/
(py3) sajid@thetalogin6:/projects/HEP_on_HPC/sajid/icarus_hepnos/split_test_balsam>

Are you able to share the script you are using to create the jobs? I'd like to try and reproduce this on theta.

Attaching them here: site_def.py.txt hepnos_apps_defs.py.txt hepnos_job_server.py.txt hepnos_jobs_remaining.py.txt

The workflow was:

python hepnos_apps_defs.py
python hepnos_job_server.py
python hepnos_jobs_remaining.py

(split the jobs into two files to see if that helps prevent the issue by submitting the server job as a standalone one).

Thanks again for helping to track this bug down and no worries about the delay.

Hopefully I can provide some helpful context, with an explanation of why you cannot delete jobs. I think part of what you are seeing is that jobs are filtered out of the Delete operation if they are still tied to a Session on the backend. This ensures that jobs currently being processed by a site don't get deleted mid-execution.

The ProcessingService and Launchers create a temporary Session resource and then use it to acquire jobs. This is handled by the JobSource client classes.

On the backend, the session API uses an atomic locking operation (SELECT FOR UPDATE ... SKIP LOCKED) to assign jobs to sessions without race conditions: this is how multiple launchers can divide a workload automatically, without processing any duplicate jobs.

Rather than keeping an explicit lock in the database (which would require long-lived transactions and be detrimental to performance), however, the Session behaves like a temporary lock or lease: it has an expiration date of a few minutes, and the Launcher/Processing services have a background thread that sends a heartbeat every minute to refresh the lease (https://github.com/argonne-lcf/balsam/blob/main/balsam/site/job_source.py#L25).

While the job is associated with a live Session, it cannot be deleted or acquired by another session. At the same time, we want to make sure that all sessions eventually get cleaned up. If the JobSource crashes and stops sending heartbeats, the session eventually becomes stale and will automatically get deleted the next time that a different session comes online.

To summarize:

If you can't delete some jobs, they are currently tied to a Session
The session might be truly alive if a thread, somewhere, is continuing to send heartbeats to the API
If the Session heartbeat thread died, and it has been at least 5 minutes, then trying to create a new session (e.g. by restarting the Site) will cause the stale sessions to get cleaned up. After that, you should be able to delete the jobs.
You can check on the sessions that are currently live at your site with Session.objects.all(). Perhaps it would be good to have a command line interface to list sessions and clean up expired ones manually, for cases like this.

@s-sajid-ali Thanks for sending your scripts. I think I've been able to reproduce the issue. I think the problem is the preprocessing hook for the HEPnOS_list_dbs app.

This is what you sent:

    def preprocess(self) -> None:
        ssg_src_file = Path(
            "/projects/HEP_on_HPC/sajid/icarus_hepnos/"
            + site_def.SITE_NAME
            + "/data/server/hepnos.ssg"
        )
        while True:
            time.sleep(5)
            print("checking to see if hepnos.ssg exists at source location")
            if ssg_src_file.exists():
                print("hepnos.ssg exists at source location, preprocess done!")
                workdir: PurePath = PurePath(Path.cwd())
                ssg_dst_file = Path(workdir.joinpath("hepnos.ssg"))
                shutil.copyfile(ssg_src_file, ssg_dst_file)

                # Ready to run
                self.job.state = JobState.preprocessed
                break
        return

I think what's happening is the site process is stuck in that loop, looking for that file, but not finding it. It cannot advance to any of the other jobs while it's stuck there. It doesn't raise an exception because it is executing valid code. This may also be related the problems with deleting the jobs.

It would be helpful to understand a bit more about your workflow. How is the ssg_src_file created? If you'd like to have a meeting to talk though what you're trying to do, feel free to email me: csimpson@anl.gov.

argonne-lcf / balsam

Unable to sync site as daemon hangs #340