Open hroncok opened 1 month ago
Triage: Two issues to solve ... 1. Why 500? 2. Return something reasonable if 500
In my experience, 500 happens when there is an unhandled Python exception. If the webserver runs in debug mode, the exception is shown, but if it is in production mode, it is hidden. If you have a development copr server with debug mode enabled, we could try reproducing there.
I am looking at the code, searching where this could have happened and I found c1fa04b6b0886319b73e9c63638be55b4d53580c -- if this wasn't deployed yet, perhaps this fixed the issue.
Hello @hroncok, thank you for the report. The step-by-step reproducer is very much appreciated.
We decided to not prioritize this issue for the next 3 months because although annoying, it seems there should be an easy workaround. I suppose only the reproducer is done via parallel
to hit the issue more easily but your actual script goes one by one? Then something like sleep 1
between calls should workaround this? If I am wrong and there isn't an easy workaround, please let us know and we will prioritize this more.
No, I use parallel to submit thousands of builds.
The workaround I use is to resubmit the failed ones later (a bit tricky to figure out which failed, but I can manage).
Another workaround is to submit the first one manually and use parallel to submit the rest after.
This happens to me fairly regularly when I run Copr impact checks to see if an upgrade of some Fedora package does not break anything. I decided to create a smaller reproducer and report it.
Using the copr CLI:
Some of the builds will fail with:
Adding
--debug
does not reveal much:Reproducer (uses moreutils-parallel):
Often some of the first builds errors:
If it does not happen to you, repeat with a new directory name (
$COPR:custom:2
,$COPR:custom:3
...) until it does.Use this to cancel the running/pending builds after you run the above in case you want to preserve resources for others:
I hypothesize that a first build in the custom directory does something special (wrt creating the directory) and when multiple builds think they are first, they all attempt to do the special thing at the same time and some of them get an unhandled exception because of a race condition.