Open mattwynne opened 4 years ago
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.
We saw this again today when testing #545
OK I've caught this today with a new acceptance test. You can reproduce it with:
show_logs=1 yarn acceptance test:wip
Here's the error we see:
fatal: Unable to create '/private/var/folders/l9/95tdbmtd3_s0jh3m9hjt6sk80000gn/T/tmp-84916-93ii628SIl64/7265706f2d38323034333935352d383137302d343661302d613234302d316536353434313064626237/shallow.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
at GitDirectory.exec (/Users/matt/projects/git-en-boite/packages/local-clones/src/git_directory.ts:24:30)
It's interesting that it's using shallow.lock
in this instance. Most of the stuff I can find about git's lockfiles talks about index.lock
though even that I can't find much documentation about.
Moved this into a branch: 211-support-concurrent-git-commands so we don't have too many WIP scenarios in the main branch.
I've resolved this for now by returning a 503 error with a 60-second retry-after
header. This is better than a dumb 500 error, but much quicker than trying to handle the concurrency issue properly. I think we should monitor it and see how much of a problem it becomes.
We could also do more proactive exploratory testing on it. I can imagine that there could be other errors coming from git than the ones it currently handles if we try doing several things at the same time, such as making a commit during a fetch, or fetching during a push.
I've had a think about the options for resolving this "properly". Largely, this comes down to choices about which layer we want to handle this problem at. Is it something for the local-clones git adapter to handle, or should the domain be aware that a repo could be busy? Or should the application itself be responsible for queuing commands for each repo so only one can happen at the same time.
One thing I'm pretty sure of is that we should avoid trying to solve this by bending the bullmq infrastructure to use one queue per repo. We tried this before (see ADR 0011) and it didn't work out well.
My hunch is that the problem could be solved by domain events, once we've implemented #242 so that any node can see all the events.
I wonder whether this is another instance of this same problem:
InvalidRepoUrl: warning: unable to access './config': Stale file handle
fatal: unable to access 'config': Stale file handle
at handleRemoteError (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:9:11)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async handleValidateRemote (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:12:5)
at async handleConnect (/app/packages/local-clones/dist/handlers/handleConnect.js:6:5)
at async Worker.processJob [as processFn] (/app/packages/local-clones/dist/background_worker_local_clones.js:123:28)
at async Worker.processJob (/app/node_modules/bullmq/dist/classes/worker.js:234:28)
at async Worker.run (/app/node_modules/bullmq/dist/classes/worker.js:99:34)
We're doing #572 first, which should cut down a lot of the problems by only doing fetches when absolutely neccesary.
I have been thinking about this today, inspired by the GitHub Desktop blog post shared by @romaingweb on Slack.
If we want to get an exclusive lock on a Repo before working with it, I think the right place to do that would be in the InventoryOfRepos
which is where we get hold of a Repo
domain model instance before calling methods to do work through our local clones.
We can change this interface to offer either an ExclusiveRepo
or a ConcurrentRepo
depending on which method you call, and put the different methods on Repo
into either interface, depending on whether we believe the operation in question needs to be done exclusively or not. For calls to the ExclusiveRepo
we can drop a lock file on disk until the transaction has completed, and any other calls to ask for an ExclusiveRepo can block until that lock file has been removed.
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.
We could use something like https://www.npmjs.com/package/redis-semaphore to hold a mutex on the repo while working on it.
The concurrency problems does not happen anymore since only one container is used for workers. With the deployment on rancher the number of container has been reduced to 1 that prevents parallel call on the same repo.
The problem will need to be resolved when more containers will be needed.
An ADR has been written in https://github.com/SmartBear/git-en-boite/commit/382a2d005f5ec2a87a4a9ea35e66422db0848f4a.
Related to #210, if you run a manual fetch immediately after a connect, you'll see a failure because the background fetch triggered by the domain rule seems to be clashing with the manual fetch.
The actual error I see is this:
So this looks like at the moment it's just caused the config commands we run in
handleInit
which we're blindly running each time you open a repo folder. We can probably avoid this, but there might be other problems with the other git commands when we get past this. Needs more investigation.