apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.78k stars 3.47k forks source link

[CI Problem] CI-arm build queue blocked #16549

Closed Liam-Sturge closed 9 months ago

Liam-Sturge commented 9 months ago

Jobs that require an arm agent are struggling to find an Graviton-3 executor in Jenkins. There are hundreds of builds queued for ci-arm. Some of the builds queued are stuck. This appears to have started on 8th February.

Branch/PR Failing

https://github.com/apache/tvm/pull/16183 may have caused the backlog of jobs. Recently a large amount of commits were pushed simultaneously for this PR and builds have been scheduled for each commit. I have attached a graph of the agent queue and allocation, which shows a big spike in queued jobs.

graviton3_executors

Jenkins Link

Build logs on these jobs indicate that some of the builds are stuck. They have ended with an unfinished status, but remain in the queue. See this console log as one example of many:

https://ci.tlcpack.ai/job/tvm-arm/job/PR-16183/495/console

Triage

Please refer to the list of label tags here to find the relevant tags and add them below in a bullet format (example below).

CC @lhutton1 @konturn @tqchen @yongwww

Lunderberg commented 9 months ago

Huh, that definitely wasn't intended with that PR. The large number of commits was due to rebasing onto main, where I had expected that only the last commit on the rebased branch would be built, as the intermediate commits don't require testing.

Lunderberg commented 9 months ago

This PR is also one that impacts packed_func.h, which is included by practically everything in TVM. So, not only is it re-building TVM once for each rebased commit, but they're also full rebuilds that can't benefit from ccache.

Lunderberg commented 9 months ago

Short-term, I've stopped all tasks related to #16183, and the ARM queue is recovering. All remaining tasks are related to other PRs, as it works through the backlog.

image

Long-term, it looks like there's a Jenkins option disableConcurrentBuilds(abortPrevious: true) (stack overflow link, GH link)) that we should enable. If there's two concurrent builds for the same PR, it would cancel the previous one.

Liam-Sturge commented 9 months ago

Hi @Lunderberg, I too would have expected that only the last commit on the rebased branch would have been built. Really seems odd that this isn't the default behaviour here.

Thanks for looking in to it and getting the queue moving again. I agree that setting the Jenkins option disableConcurrentBuilds(abortPrevious: true) sounds like a sensible idea. For now, I am happy to close this issue as resolved.

lhutton1 commented 7 months ago

This issue also seems to have occurred on #16425