CenterForOpenScience / osf.io

Facilitating Open Science
https://osf.io
Apache License 2.0
665 stars 327 forks source link

Concurrent access to a Google Drive folder via the API may result in only 502 Bad Gateway being returned #10661

Open yacchin1205 opened 1 month ago

yacchin1205 commented 1 month ago

Accessing a Google Drive folder via the API at the same time may cause it to only return a 502 Bad Gateway afterwards.

Reproduction steps:

  1. Create a project in osf.io.
  2. Attach Google Drive to the created project.
  3. Create a folder (in this case MultipleObjectTest) under Google Drive.
  4. Acquire a Personal Access Token from the user's Settings.
  5. Execute the following commands from the user's shell.
$ export OSF_TOKEN=(Personal Access Token obtained.)
$ export NODE_ID=(GUID of the created project)
# Sending requests to the API in three parallel jobs.
$ seq 5 | xargs -P 3 -I{} curl -H "Authorization: Bearer ${OSF_TOKEN}" https://api.osf.io/v2/nodes/${NODE_ID}/files/googledrive/MultipleObjectTest/

After a few runs, it only returns a 502 Bad Gateway.

$ seq 5 | xargs -P 3 -I{} curl -H "Authorization: Bearer ${OSF_TOKEN}" https://api.osf.io/v2/nodes/${NODE_ID}/files/googledrive/MultipleObjectTest/
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx</center>
</body>
</html>
...

From that point afterwards, a 502 Bad Gateway will be returned for a single request.

When I tried the same test for a container launched by docker-compose as a development environment, the following logs were output.

...
  File "./api/nodes/views.py", line 1199, in get_default_queryset
    file_ids = [f.id for f in self.bulk_get_file_nodes_from_wb_resp(files_list)]
  File "./api/base/views.py", line 642, in bulk_get_file_nodes_from_wb_resp
    file_obj = base_class.objects.get(target_object_id=node.id, target_content_type=content_type, _path='/' + attrs['path'].lstrip('/'))
  File "/usr/lib/python3.6/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 384, in get
    (self.model._meta.object_name, num)
addons.googledrive.models.MultipleObjectsReturned: get() returned more than one GoogleDriveFile -- it returned 2!

It appears that multiple BaseFileNodes have been created, which would normally be expected to be one for a path.

yacchin1205 commented 1 month ago

The BaseFileNode is supposed to be created by get_or_create, but this process may create several BaseFileNodes with the same parameters if they are accessed concurrently.

To avoid creating duplicate objects, it is necessary to add a UNIQUE constraint. In the past, there was a UNIQUE INDEX for BaseFileNode, but it is not found in the current code...

I am also aware that some Storage Providers may have files that can have the same name... For example, Dataverse https://github.com/CenterForOpenScience/osf.io/pull/9959 Similarly, it should be considered that Google Drive can have different file nodes with the same name. https://github.com/CenterForOpenScience/osf.io/pull/8835 <= Is this Pull Request not reflected?

I will attempt to fix it... So, what should be the approach to attempting to fix it?