ansible / ansibullbot

Bot for management of Ansible issues and PRs on GitHub.
GNU General Public License v3.0
203 stars 126 forks source link

Bot occasionally hangs #1585

Closed samdoran closed 3 years ago

samdoran commented 3 years ago

The bot process occasinoally hangs but does not die. I have had to restart it twice now. The process is still running, but does not appear to be doing any work. I don't see anything obvious, but here are the last lines in the log file at the time the process was idle:

2021-07-26 19:15:07,981 INFO 2991236 ansible.py:_should_skip_issue:178 skipping: no changes since last run
2021-07-26 19:15:07,981 INFO 2991236 ansible.py:run:335 finished triage for <ansibullbot.wrappers.issuewrapper.IssueWrapper object at 0x7f9259571e80> in 0.004019s
2021-07-26 19:15:07,982 INFO 2991236 ansible.py:run:262 starting triage for https://github.com/ansible/ansible/issues/75321
2021-07-26 19:15:07,983 INFO 2991236 ansible.py:_should_skip_issue:178 skipping: no changes since last run
2021-07-26 19:15:07,983 INFO 2991236 ansible.py:run:335 finished triage for <ansibullbot.wrappers.issuewrapper.IssueWrapper object at 0x7f925cde66a0> in 0.001264s
2021-07-26 19:15:07,983 INFO 2991236 ansible.py:run:262 starting triage for https://github.com/ansible/ansible/issues/75315
2021-07-26 19:15:07,984 INFO 2991236 ansible.py:_should_skip_issue:178 skipping: no changes since last run
2021-07-26 19:15:07,984 INFO 2991236 ansible.py:run:335 finished triage for <ansibullbot.wrappers.issuewrapper.IssueWrapper object at 0x7f925e289520> in 0.000902s
2021-07-26 19:15:07,985 INFO 2991236 ansible.py:run:262 starting triage for https://github.com/ansible/ansible/pull/75314
2021-07-26 19:15:07,985 INFO 2991236 ansible.py:run:281 creating CI wrapper
2021-07-26 19:15:07,992 DEBUG 2991236 github.py:inner:117 qcounter: 79
2021-07-26 19:15:07,992 DEBUG 2991236 github.py:inner:121 ratelimited call #1 [<class 'ansibullbot.wrappers.issuewrapper.IssueWrapper'>] [pullrequest_check_runs] [4850]
2021-07-26 19:15:07,992 INFO 2991236 issuewrapper.py:pullrequest_check_runs:399 fetching pull request check runs
2021-07-26 19:15:07,998 DEBUG 2991236 github.py:inner:117 qcounter: 80
2021-07-26 19:15:07,998 DEBUG 2991236 github.py:inner:121 ratelimited call #1 [<class 'ansibullbot.wrappers.issuewrapper.IssueWrapper'>] [get_commits] [4849]
2021-07-26 19:15:07,998 DEBUG 2991236 issuewrapper.py:pullrequest:383 @pullrequest.get_pullrequest #75314
2021-07-26 19:15:08,004 DEBUG 2991236 github.py:inner:117 qcounter: 81
2021-07-26 19:15:08,005 DEBUG 2991236 github.py:inner:121 ratelimited call #1 [<class 'ansibullbot.wrappers.ghapiwrapper.RepoWrapper'>] [get_pullrequest] [4848]
2021-07-26 19:15:08,227 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repos/ansible/ansible/pulls/75314 HTTP/1.1" 200 None
2021-07-26 19:15:08,367 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repos/ansible/ansible/pulls/75314/commits HTTP/1.1" 200 None
2021-07-26 19:15:08,723 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repos/ansible/ansible/commits/db8d039a28745a1626196cb9d3a2af212397b3de/check-runs HTTP/1.1" 200 None
2021-07-26 19:15:08,942 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repositories/3638964/commits/db8d039a28745a1626196cb9d3a2af212397b3de/check-runs?page=2 HTTP/1.1" 200 None
2021-07-26 19:15:09,188 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repositories/3638964/commits/db8d039a28745a1626196cb9d3a2af212397b3de/check-runs?page=3 HTTP/1.1" 200 None
2021-07-26 19:15:09,415 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repositories/3638964/commits/db8d039a28745a1626196cb9d3a2af212397b3de/check-runs?page=4 HTTP/1.1" 200 None
2021-07-26 19:15:09,537 DEBUG 2991236 connectionpool.py:_make_request:433 https://api.github.com:443 "GET /repositories/3638964/commits/db8d039a28745a1626196cb9d3a2af212397b3de/check-runs?page=5 HTTP/1.1" 200 None
2021-07-26 19:15:09,539 DEBUG 2991236 azp.py:build_id:82 Azure Pipelines build IDs found: {20839}
2021-07-26 19:15:09,540 DEBUG 2991236 net_tools.py:fetch:35 https://dev.azure.com/ansible/fc938e26-c865-4d62-b731-a77bbe8bf5c4/_apis/build/builds/20839/timeline/?api-version=6.0
2021-07-26 19:15:09,540 INFO 2991236 net_tools.py:_inner_fetch:22 get https://dev.azure.com/ansible/fc938e26-c865-4d62-b731-a77bbe8bf5c4/_apis/build/builds/20839/timeline/?api-version=6.0
2021-07-26 19:15:09,542 DEBUG 2991236 connectionpool.py:_new_conn:939 Starting new HTTPS connection (1): dev.azure.com:443
2021-07-26 19:15:10,980 DEBUG 2991236 connectionpool.py:_make_request:433 https://dev.azure.com:443 "GET /ansible/fc938e26-c865-4d62-b731-a77bbe8bf5c4/_apis/build/builds/20839/timeline/?api-version=6.0 HTTP/1.1" 200 None

From looking through the logs at other requests to dev.azure.com, it does not appear to stall even if a 403 is returned.

It may or may not be relevant, but we did have to move our scaling set to a different region a few weeks ago because instances got stuck and jobs were not running. We moved it back yesterday.

samdoran commented 3 years ago

It looks like it is hanging on the requests.get() call here.

https://github.com/ansible/ansibullbot/blob/3a51e2a365fa4c6647cf4a02fc80ca54f187f18d/ansibullbot/utils/net_tools.py#L22-L24

Hopefully setting a timeout is all that's needed.

samdoran commented 3 years ago

From looking at the logs, I believe this is the fetch() call that is getting stuck:

https://github.com/ansible/ansibullbot/blob/3a51e2a365fa4c6647cf4a02fc80ca54f187f18d/ansibullbot/ci/azp.py#L104-L105

samdoran commented 3 years ago

As another data point, I have not seen these requests hang since opening this issue yesterday.