Uploading artifacts from GH actions consistently fails with 503 error

ericvergnaud commented 1 year ago

GH builds regularly fail with the 2 cpp targets using gcc. The error occurs not during the build/test itself, but when uploading the artifacts to GH. The error is 503 (service unavailable) Could it be that the artifact is too large (log says 150994943 bytes) ? See https://github.com/antlr/antlr4/actions/runs/4423203675/jobs/7755704404 @hs-apotell would you be able to look into this ?

ericvergnaud commented 1 year ago

Interestingly, re-running the failed jobs succeeds, and the last artifact size in a successful build is 'only' 104579386 bytes. This shows inconsistency across builds and smells like a polluted reuse of a previous build...

ericvergnaud commented 1 year ago

Also it seems no tests are run for cpp builds... very weird

hs-apotell commented 1 year ago

Notably, this builds failing wasn't the case always. It seems to have started happening more consistently in recent times. Has anything substantial changes in the past few weeks that could correlate with the failures?

Digging into a few failed builds, the error is not always consistent either - 400 and 503. But the errors are always network related and so rebuilds succeeding isn't surprising or unexpected.

Could it be that the artifact is too large (log says 150994943 bytes) ?

Size wouldn't matter here. We have other builds producing and uploading artifacts that are over 3GB. antlr doesn't generate anywhere close to that size. Also, the size of the artifact uploaded vs. files on disk will be different because the uploading action zips them.

This shows inconsistency across builds and smells like a polluted reuse of a previous build...

Every build is running on a pristine VM machine. There is no pollution. If the sizes are different across builds than the generated file sizes on disks are different. How, why, which - those are questions we can follow up on. But VM pollution is not an issue.

Also it seems no tests are run for cpp builds... very weird

No tests for the cpp builds is intentional. cpp natives are built twice once using the cmake directly (i.e. not using the java wrappers) so the warnings/errors can be captured. Tests are not a concern here with these builds. They are being run as part of the other builds.

I will investigate further to narrow down the root cause of the failure.

hs-apotell commented 1 year ago

I hope this explains it - https://github.com/actions/upload-artifact/issues/270

The failures started happening when I upgraded the specific Github action from v2. to v3 on 11/27/2022.

I will create a new PR with the recommended fix for the issue.

ericvergnaud commented 1 year ago

Thanks for this. Not sure I understand your comments re testing. Can you point me to a cpp job that does run the tests ?

hs-apotell commented 1 year ago

All the below includes tests - https://github.com/antlr/antlr4/actions/runs/4381619032/jobs/7669863716 https://github.com/antlr/antlr4/actions/runs/4381619032/jobs/7669864720 https://github.com/antlr/antlr4/actions/runs/4381619032/jobs/7669865473

ericvergnaud commented 1 year ago

Ah I get it now, the cpp is for building the lib, and then the regular job uses it for testing. And the segregation is for building using different 'flavors'... thanks.

hs-apotell commented 1 year ago

May be the jobs can use some renaming to drive the intent home. Any suggestions?

ericvergnaud commented 1 year ago

build-cpp-library ?

jimidle commented 1 year ago

If this is truly a network issue, should we not report this to GitHub?

kaby76 commented 1 year ago

I've had a ton of network errors with Github Actions in grammars-v4. It was particularly bad for the Mac servers, which I believe are sub-par hardware (but there's no /proc/cpuinfo, and arch and uname -a don't give squat). To get around all the network mess, I had to write code to do builds with retries. I also try to avoid certain times of the day with some big PRs.

(Eventually, the only thing that really, really fixed the problem was to make the builds only work on the changed grammars, so the network wasn't being pounded to death by all the simulaneous builds. I can only guess that Github probably virtualizes multiple machines on one piece of hardware, which still has only one shared network link. Your workflow spawns 33 builds!)

I looked at the code for upload-artifact. The error is raised here. Perhaps you could fork a copy, create your own "antlr-upload-archive", and employ a retry of the crappy retry. Maybe if you retry a good number, things might eventually work.

Unfortunately, the toolkit hardwires the retry count to 5, and does not offer an API to modify the value.

There was some issue somewhere in github actions that mentioned that last "chunk" was having problems. Maybe this is it? But, you don't do an "ls -l *.tgz" in the "Prepare artifacts" step to know how big the file really is, and whether the last chunk is being sent.

hs-apotell commented 1 year ago

Yes, it is a network issue but not a Github issue. This seems to be somehow related to implementation of the upload-artifact action itself. This worked in the previous version but fails with the latest version. You can follow the bug report I pointed out on the upload-artifact repository. Unfortunately, this is not the only report issue about this problem. This issue has been reported numerous times with no resolution.

I am unsure if I want to fork/clone the repository and take ownership of it. Neither I have the time to maintain it nor see an immediate need for it. If this continues to be a problem there are other actions similar to this one that we can use.

I introduced a PR with the version rollback, however, that also failed with similar problem. Will try other options to see if I can swap the action for something more reliable.

ericvergnaud commented 1 year ago

Since the artifacts are not necessary, how about disabling that step altogether ? If people complain, we can look again at a solution ?

hs-apotell commented 1 year ago

The option to continue-on-error has the same effect - ignoring the result if the upload fails.

antlr / antlr4

Uploading artifacts from GH actions consistently fails with 503 error #4185