EESSI / eessi-bot-software-layer

Bot to help with requests to add software installations to the EESSI software layer
GNU General Public License v2.0
0 stars 18 forks source link

job manager crashed while processing running jobs #193

Open boegel opened 1 year ago

boegel commented 1 year ago
/usr/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.16) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
job manager just started, logging to '/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.log', processing job ids ''
Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 640, in <module>
    main()
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 609, in main
    job_manager.process_running_jobs(current_jobs[rj])
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 373, in process_running_jobs
    repo = gh.get_repo(repo_name)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/MainClass.py", line 330, in get_repo
    headers, data = self.__requester.requestJsonAndCheck("GET", url)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
    verb, url, parameters, headers, input, self.__customConnection(url)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 502 {"message": "Server Error"}

Last bit of log:

[20230703-T12:22:41] job manager main loop: iteration 8971
[20230703-T12:22:41] job manager main loop: known_jobs='5705,5706'
[20230703-T12:22:41] run_subprocess(): 'get_current_jobs(): squeue command' by running '/usr/bin/squeue --long --user=bot' in directory '/mnt/shared/home/bot/eessi-bot-software-layer'
[20230703-T12:22:41] run_cmd(): Result for running '/usr/bin/squeue --long --user=bot' in 'None
           stdout 'Mon Jul 03 12:22:41 2023
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              5705   compute bot-buil      bot  RUNNING       8:27 1-00:00:00      1 fair-mastodon-c7g-4xlarge-0001
              5706   compute bot-buil      bot  RUNNING       7:57 1-00:00:00      1 fair-mastodon-c6g-4xlarge-0002
'
           stderr ''
           exit code 0
[20230703-T12:22:41] job manager main loop: current_jobs='5705,5706'
[20230703-T12:22:41] job manager main loop: new_jobs=''
[20230703-T12:22:41] job manager main loop: running_jobs='5705,5706'
[20230703-T12:22:41] Found metadata file at /mnt/shared/home/bot/eessi-bot-software-layer/jobs/submitted/5705/_bot_job5705.metadata
boegel commented 1 year ago

It looks like there was a problem with the connection to GitHub (see also #20)

Simply restarting the bot worked fine, finished jobs were processed.

boegel commented 1 year ago

Same crash happened again. Bot restarted a couple of minutes after the crash.

boegel commented 11 months ago

Another crash, but not exactly the same:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 640, in <module>
    main()
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 609, in main
    job_manager.process_running_jobs(current_jobs[rj])
  File "/mnt/shared/home/bot/eessi-bot-software-layer/eessi_bot_job_manager.py", line 373, in process_running_jobs
    repo = gh.get_repo(repo_name)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/MainClass.py", line 330, in get_repo
    headers, data = self.__requester.requestJsonAndCheck("GET", url)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/Requester.py", line 355, in requestJsonAndCheck
    verb, url, parameters, headers, input, self.__customConnection(url)
  File "/mnt/shared/home/bot/.local/lib/python3.6/site-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.GithubException: 500 null