cirruslabs / cirrus-ci-agent

Agent to execute Cirrus CI tasks
Mozilla Public License 2.0
13 stars 6 forks source link

Clone Hangs #358

Closed fkorotkov closed 5 months ago

fkorotkov commented 5 months ago

We've received a report from a user that clone hangs and task just timeout. Here are the relevant end of logs:

Compressing objects:  93% (150/161)
Compressing objects:  94% (152/161)
Compressing objects:  95% (153/161)
Compressing objects:  96% (155/161)
Compressing objects:  97% (157/161)
Compressing objects:  98% (158/161)
Compressing objects:  99% (160/161)
Compressing objects: 100% (161/161)
Compressing objects: 100% (161/161), done.

Failed to clone: context canceled!

We should investigate the issue. See if recent go-git update caused it and if we can mitigate the issue somehow.

edigaryev commented 5 months ago

The agent seems to be getting SIGTERM from the host where it runs:

2024/04/09 22:05:16 Captured terminated...

This in turn causes it to fail to report the status:

2024/04/09 22:05:16 Failed to report command updates: rpc error: code = Canceled desc = context canceled
[...]
2024/04/09 22:05:16 Failed to report that the agent has finished: context canceled

We could probably check if we're still in the timeout bounds declared by the response.TimeoutInSeconds and try to report these using a separate context.

The reason for the SIGTERM is unclear since the agent is not running in our infrastructure, but hopefully the fix above would allow to reproduce this faster as the tasks won't hang for the whole duration of response.TimeoutInSeconds.