SmartBear / git-en-boite

Web service facade for interacting with git repos on various providers - GitHub/BitBucket/GitLab etc.
MIT License
18 stars 7 forks source link

Concurrent git commands on the same repo fail #211

Open mattwynne opened 4 years ago

mattwynne commented 4 years ago

Related to #210, if you run a manual fetch immediately after a connect, you'll see a failure because the background fetch triggered by the domain rule seems to be clashing with the manual fetch.

The actual error I see is this:

[server]   <-- POST /repos/smoke-test-423955FC-68A3-4A72-B4ED-B9FCBEED12D8
[server] (node:7496) UnhandledPromiseRejectionWarning: Error: Git command `config gc.pruneExpire never` returned exit code 255:
[server] error: could not lock config file config: File exists
[server]
[server]     at GitDirectory.exec (/Users/matt/projects/git-en-boite/packages/local-git/src/git_directory.ts:27:13)
[server]     at processTicksAndRejections (internal/process/task_queues.js:97:5)
[server]     at exports.handleInit (/Users/matt/projects/git-en-boite/packages/local-git/src/handlers/handleInit.ts:9:3)
[server]     at RepoFactory.open (/Users/matt/projects/git-en-boite/packages/local-git/src/repo_factory.ts:49:5)
[server]     at Function.openGitRepo (/Users/matt/projects/git-en-boite/packages/local-git/src/dugite_git_repo.ts:17:22)
[server]     at BackgroundGitRepos.openGitRepo (/Users/matt/projects/git-en-boite/packages/local-git/src/background_git_repos.ts:64:21)
[server]     at DiskRepoIndex.find (/Users/matt/projects/git-en-boite/packages/repo-index/src/disk_repo_index.ts:20:29)
[server]     at LaBoîte.fetchFromRemote (/Users/matt/projects/git-en-boite/packages/app/src/la_boîte.ts:43:18)

So this looks like at the moment it's just caused the config commands we run in handleInit which we're blindly running each time you open a repo folder. We can probably avoid this, but there might be other problems with the other git commands when we get past this. Needs more investigation.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.

mattwynne commented 3 years ago

We saw this again today when testing #545

mattwynne commented 3 years ago

OK I've caught this today with a new acceptance test. You can reproduce it with:

show_logs=1 yarn acceptance test:wip

Here's the error we see:

fatal: Unable to create '/private/var/folders/l9/95tdbmtd3_s0jh3m9hjt6sk80000gn/T/tmp-84916-93ii628SIl64/7265706f2d38323034333935352d383137302d343661302d613234302d316536353434313064626237/shallow.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

at GitDirectory.exec (/Users/matt/projects/git-en-boite/packages/local-clones/src/git_directory.ts:24:30)
mattwynne commented 3 years ago

It's interesting that it's using shallow.lock in this instance. Most of the stuff I can find about git's lockfiles talks about index.lock though even that I can't find much documentation about.

mattwynne commented 3 years ago

Moved this into a branch: 211-support-concurrent-git-commands so we don't have too many WIP scenarios in the main branch.

mattwynne commented 3 years ago

I've resolved this for now by returning a 503 error with a 60-second retry-after header. This is better than a dumb 500 error, but much quicker than trying to handle the concurrency issue properly. I think we should monitor it and see how much of a problem it becomes.

We could also do more proactive exploratory testing on it. I can imagine that there could be other errors coming from git than the ones it currently handles if we try doing several things at the same time, such as making a commit during a fetch, or fetching during a push.

mattwynne commented 3 years ago

I've had a think about the options for resolving this "properly". Largely, this comes down to choices about which layer we want to handle this problem at. Is it something for the local-clones git adapter to handle, or should the domain be aware that a repo could be busy? Or should the application itself be responsible for queuing commands for each repo so only one can happen at the same time.

One thing I'm pretty sure of is that we should avoid trying to solve this by bending the bullmq infrastructure to use one queue per repo. We tried this before (see ADR 0011) and it didn't work out well.

My hunch is that the problem could be solved by domain events, once we've implemented #242 so that any node can see all the events.

mattwynne commented 3 years ago

I wonder whether this is another instance of this same problem:

InvalidRepoUrl: warning: unable to access './config': Stale file handle
fatal: unable to access 'config': Stale file handle

    at handleRemoteError (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:9:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async handleValidateRemote (/app/packages/local-clones/dist/handlers/handleValidateRemote.js:12:5)
    at async handleConnect (/app/packages/local-clones/dist/handlers/handleConnect.js:6:5)
    at async Worker.processJob [as processFn] (/app/packages/local-clones/dist/background_worker_local_clones.js:123:28)
    at async Worker.processJob (/app/node_modules/bullmq/dist/classes/worker.js:234:28)
    at async Worker.run (/app/node_modules/bullmq/dist/classes/worker.js:99:34)
mattwynne commented 3 years ago

We're doing #572 first, which should cut down a lot of the problems by only doing fetches when absolutely neccesary.

mattwynne commented 3 years ago

I have been thinking about this today, inspired by the GitHub Desktop blog post shared by @romaingweb on Slack.

If we want to get an exclusive lock on a Repo before working with it, I think the right place to do that would be in the InventoryOfRepos which is where we get hold of a Repo domain model instance before calling methods to do work through our local clones.

We can change this interface to offer either an ExclusiveRepo or a ConcurrentRepo depending on which method you call, and put the different methods on Repo into either interface, depending on whether we believe the operation in question needs to be done exclusively or not. For calls to the ExclusiveRepo we can drop a lock file on disk until the transaction has completed, and any other calls to ask for an ExclusiveRepo can block until that lock file has been removed.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions 🙏.

mattwynne commented 3 years ago

We could use something like https://www.npmjs.com/package/redis-semaphore to hold a mutex on the repo while working on it.

cbohiptest commented 3 years ago

The concurrency problems does not happen anymore since only one container is used for workers. With the deployment on rancher the number of container has been reduced to 1 that prevents parallel call on the same repo.

The problem will need to be resolved when more containers will be needed.

An ADR has been written in https://github.com/SmartBear/git-en-boite/commit/382a2d005f5ec2a87a4a9ea35e66422db0848f4a.