DDMAL / hpc-trainer-component

MIT License
0 stars 0 forks source link

Calvo HPC job not getting into Cedar's queue #11

Open martha-thomae opened 3 years ago

martha-thomae commented 3 years ago

Calvo HPC job (real name: Training Model for Patchwise Analysis of Music Document - HPC) got stuck in processing. Usually, this job takes around 4–5 hours to complete, now it got stuck in processing for more than 12 hours and it won't finish.

When I went to check the queue of jobs in Cedar, this job was nowhere to be found. For some reason, running the job in rodan2 is not getting the job into the Cedar queue and, therefore, it is never executed by Cedar. @JRegimbal suspects that this could be an issue with the check.py script. We have to make sure that check.py is being run frequently from cedar (since this script checks if there is anything in Rodan that needs to be handled). Only the person in charge of that account can check this (we suspect that might be @deepio?)

martha-thomae commented 3 years ago

The reason why this wasn't working was because the Compute Canada account where the hpc-trainer-component was is not active anymore (neither Juliette's nor Alex's). We have set this back up, now in Ich's account, and documented the instructions for making it work.

There is a new issue now:

The check.py is called every hour (at the hour) due to the run_check.py being added to the crontab. However, even though check.py detects when there is a new job coming from Rodan, it is not submitting the new Slurm job successfully. When I manually execute the run_check (doing ./run_check), then check.py actually submits the new Slurm job and eventually we get the results back into Rodan and the Rodan job is finalized.

This can be seen in the following logs:

check.log when there is no job being submitted by rodan:

See 3rd line: No job present

2021-08-25 09:01:01,963 Connection workflow succeeded: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncSSLTransport object at 0x7ff5e326a350> params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>
2021-08-25 09:01:01,963 Created channel=1
2021-08-25 09:01:01,976 No job present.
2021-08-25 09:01:01,977 Closing connection (200): Normal shutdown
...
2021-08-25 09:01:01,986 User-initiated close: result=BlockingConnection__OnClosedArgs(connection=<SelectConnection CLOSED transport=None params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>, error=ConnectionClosedByClient: (200) 'Normal shutdown')

check.log when there is a new job submitted in rodan (and, therefore, detected by the check.py):

See 3rd line: Job received from queue; and the last two lines regarding the 'EXCEPTION' and the 'sbatch' error

2021-08-25 10:00:14,869 Connection workflow succeeded: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncSSLTransport object at 0x7fc61e1ab190> params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>
2021-08-25 10:00:14,870 Created channel=1
2021-08-25 10:00:14,891 Job received from queue
2021-08-25 10:00:14,892 Attempting to authenticate at https://rodan2.simssa.ca/api/auth/token/...
2021-08-25 10:00:16,610 Received code {"username":"rodan","first_name":"","last_name":"","is_superuser":true,"url":"https://rodan2.simssa.ca/api/user/2/","is_active":true,"workflow_runs":[],"token":"9b7268ea7081a6fff38399274928857df38d375a","is_staff":true,"workflows":[],"email":"","projects":[]} on authorization
2021-08-25 10:00:16,611 Token: 9b7268ea7081a6fff38399274928857df38d375a
2021-08-25 10:00:16,620 Reply queue: hpc-results
2021-08-25 10:00:16,644 Closing connection (200): Normal shutdown
...
2021-08-25 10:00:16,656 User-initiated close: result=BlockingConnection__OnClosedArgs(connection=<SelectConnection CLOSED transport=None params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>, error=ConnectionClosedByClient: (200) 'Normal shutdown')
2021-08-25 10:00:16,658 EXCEPTION
2021-08-25 10:00:16,659 [Errno 2] No such file or directory: 'sbatch': 'sbatch'

See that the detected job never results in a new Slurm job submitted in the Cedar queue (compare that with the following logs).

check.log when I manually execute ./run_check:

See the 3rd line: Job received from queue and later on the Submitted batch job JOBID

2021-08-25 10:22:50,036 Connection workflow succeeded: <SelectConnection OPEN transport=<pika.adapters.utils.io_services_utils._AsyncSSLTransport object at 0x7f73ba2ce150> params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>
2021-08-25 10:22:50,036 Created channel=1
2021-08-25 10:22:50,052 Job received from queue
2021-08-25 10:22:50,052 Attempting to authenticate at https://rodan2.simssa.ca/api/auth/token/...
2021-08-25 10:22:53,778 Received code {"username":"rodan","first_name":"","last_name":"","is_superuser":true,"url":"https://rodan2.simssa.ca/api/user/2/","is_active":true,"workflow_runs":[],"token":"9b7268ea7081a6fff38399274928857df38d375a","is_staff":true,"workflows":[],"email":"","projects":[]} on authorization
2021-08-25 10:22:53,779 Token: 9b7268ea7081a6fff38399274928857df38d375a
2021-08-25 10:22:53,792 Reply queue: hpc-results
2021-08-25 10:23:08,758 Submitted batch job 11856681

2021-08-25 10:23:08,759 Preparing to submit dependency for job 11856681
2021-08-25 10:23:09,359 Dependency Submitted
2021-08-25 10:23:09,369 No job present.
2021-08-25 10:23:09,369 Closing connection (200): Normal shutdown
...
2021-08-25 10:23:09,382 User-initiated close: result=BlockingConnection__OnClosedArgs(connection=<SelectConnection CLOSED transport=None params=<ConnectionParameters host=rodan2.simssa.ca port=5671 virtual_host=/ ssl=True>>, error=ConnectionClosedByClient: (200) 'Normal shutdown')