SoftInstigate / restheart

Rapid API Development with MongoDB
https://restheart.org
GNU Affero General Public License v3.0
805 stars 171 forks source link

Concurrent DELETE /GET of a file trigger Internal Server Error #304

Closed grossmueller-espirit closed 5 years ago

grossmueller-espirit commented 6 years ago

This issue is related to issue #297 The same setup applies as described there, i.e. multiple threads / clients try to DELETE /dbname/bucketname.files/fileid.

Expected Behavior

Depending on the existence of the file, 204 or 404

Current Behavior

HTTP 500 -> Internal Server Error

Context

multiple threads send DELETE request to /dbname/bucketname.files/fileid within 100ms

Environment

RH 3.4.2, Mongo 3.4.10

Steps to Reproduce

Precondition : Save a file using PUT /dbname/bucketname.files/fileid A client checks the existence of a file with GET /dbname/bucketname.files/fileid Upon the response 200, it sends DELETE request to /dbname/bucketname.files/fileid. Else nothing. As soon as several clients do this in parallel, one client will succeed in deleting the file and receives a 204. All others will receive a 500 response instead of 404.

I have attached logfiles for details.

RH_Issue_DELETE.log RH_Issue_GET.log

johnnywiller commented 6 years ago

The exception MongoGridFSException, is only thrown when the file doesn't exist (i.e has been deleted by a previous call). May we can just surround with a try/catch and ignore returning 204 (SC_NO_CONTENT) or 404 (SC_NOT_FOUND) ?

mkjsix commented 6 years ago

Hi @johnnywiller

I merged your pull request, but I also experimentally added a lock around the delete operation (method GridFsDAO.deleteFile) on master branch.

Additionally, I added a log in case of successful gridFSBucket.delete(fileId), and a warning if it raises a MongoGridFSException.

This lock should limit a lot the concrete possibility of concurrent deletions of the same file, even if it could slow down a bit things in case of massive deletion requests. Of course, it won't prevent concurrent access in case of clustered RESTHeart instances, as the lock can only be local.

Do you mind to test it a bit and tell us?

mkjsix commented 6 years ago

@grossmueller-espirit could you please also test this?

lenalebt commented 6 years ago

Tanja is currently on vacation, but I can at least give a bit of feedback: We're using clustered instances, so this fix will most probably not help for our use-case.

mkjsix commented 6 years ago

@lenalebt IMO that depends mostly on how your LB is configured. Even without any sticky session enabled, It's usually very likely that a stream of requests on the same resource within such a short timeframe (the issue mentioned "within 100ms") are sent by the LB to the same node in the cluster, so the fix should work properly except for some very edge cases. To fully avoid the possibility we should introduce a distributed synchronization mechanism among cluster nodes, which looks a bit overkill to me.

Please let me know if your tests are showing something.

grossmueller-espirit commented 6 years ago

@mkjsix Back from vacation :-) I will look at this asap.

mkjsix commented 5 years ago

Closing this, please re-open in case of news.