fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/
115 stars 58 forks source link

Race condition 500 Internal Server Error when submitting multiple builds to a directory that has never been used #3358

Open hroncok opened 1 month ago

hroncok commented 1 month ago

This happens to me fairly regularly when I run Copr impact checks to see if an upgrade of some Fedora package does not break anything. I decided to create a smaller reproducer and report it.

Using the copr CLI:

  1. create a new copr project
  2. add packages from Fedora distgit (other sources may also be impacted)
  3. submit several builds to a custom directory that has never been used yet, at the same time

Some of the builds will fail with:

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.

Adding --debug does not reveal much:

Server response:
----------------

500 Internal Server Error

Internal Server Error
The server encountered an internal error or
misconfiguration and was unable to complete
your request.
Please contact the server administrator at 
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.
More information about this error may be available
in the server error log.

Reproducer (uses moreutils-parallel):

COPR=reproducer-race
copr create $COPR --chroot fedora-rawhide-x86_64 --delete-after-days 30
copr add-package-distgit $COPR --webhook-rebuild off --commit rawhide --name dummy-test-package-gloster
parallel -j8 copr build-package $COPR:custom:1 --nowait --background --name -- dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster dummy-test-package-gloster

Often some of the first builds errors:

Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

Something went wrong:
Error: Response is not in JSON format, there is probably a bug in the API code.
Try 'copr-cli --debug' for more info.
Build was added to reproducer-race:
  https://copr.fedorainfracloud.org/coprs/build/...
Created builds: ...

If it does not happen to you, repeat with a new directory name ($COPR:custom:2, $COPR:custom:3...) until it does.

Use this to cancel the running/pending builds after you run the above in case you want to preserve resources for others:

parallel copr cancel -- $(copr list-builds --output-format text-row $COPR | cut -f1)

I hypothesize that a first build in the custom directory does something special (wrt creating the directory) and when multiple builds think they are first, they all attempt to do the special thing at the same time and some of them get an unhandled exception because of a race condition.

FrostyX commented 1 month ago

Triage: Two issues to solve ... 1. Why 500? 2. Return something reasonable if 500

hroncok commented 1 month ago

In my experience, 500 happens when there is an unhandled Python exception. If the webserver runs in debug mode, the exception is shown, but if it is in production mode, it is hidden. If you have a development copr server with debug mode enabled, we could try reproducing there.

hroncok commented 1 month ago

I am looking at the code, searching where this could have happened and I found c1fa04b6b0886319b73e9c63638be55b4d53580c -- if this wasn't deployed yet, perhaps this fixed the issue.

FrostyX commented 1 month ago

Hello @hroncok, thank you for the report. The step-by-step reproducer is very much appreciated.

We decided to not prioritize this issue for the next 3 months because although annoying, it seems there should be an easy workaround. I suppose only the reproducer is done via parallel to hit the issue more easily but your actual script goes one by one? Then something like sleep 1 between calls should workaround this? If I am wrong and there isn't an easy workaround, please let us know and we will prioritize this more.

hroncok commented 1 month ago

No, I use parallel to submit thousands of builds.

The workaround I use is to resubmit the failed ones later (a bit tricky to figure out which failed, but I can manage).

Another workaround is to submit the first one manually and use parallel to submit the rest after.