WaterButler is a Python web application for interacting with various file storage services via a single RESTful API, developed at Center for Open Science.
Apache License 2.0
62
stars
76
forks
source link
[ENG-330] [OATHPIT] Throw figshare into the Oath Pit #391
Enable and update the figshare provider for aiohttp3.
Changes
aiohttp 0.18 -> 3
The figshare provider directly calls aiohttp.request() in aiohttp 0.18 for some downloads, which no longer works with aiohttp-3.5. Instead, use the updated super().make_request(). However, this only partially fix the download issue.
Authorization Header
To fully fix the download issue, the figshare auth header are dropped for published / public files.
As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use no_auth_header=True to inform super().make_request() to drop the header.
As for the core provider: Instead of modifying how headers are built with build_headers(), simply update make_request() to drop the Authorization header built into the headers.
Broken Celery Task (Existing Issue on staging)
Move / copy fails for the figshare provider since the __new__() in the FigshareProvider class does not accept extra arguments such as is_celery_task which was recently added for celery tasks. The fix is simply adding the missing **kwargs. Here is the related commit: add flag to BaseProvider to mark if it's running in celery.
Download Stream Size (Existing Issue on staging)
Fixed missing stream size when copy / move private files for figshare.
Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size.
The root problem: the download response for private files using the figshare API does not provide the Content-Length header that is used to build the response stream and to set the stream size. On the contrary, the download response for published / public files using figshare's backend S3 API provides the Content-Length header.
The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using his size if the Content-Length header is not provided.
Side effects
Fixed move / copy which is broken on both staging and prod.
QA Notes
About figshare and WB-figshare
figshare is very different from other providers due to the "article type" concept.
From current WB's perspective, a figshare article can be either a folder (type 3 Dataset and deprecated type 3 fileset) or a file (all other types).
From figshare's perspective, any article type is a set of files.
Articles created directly on figshare must have a type.
Files created via WB becomes a figshare article without a type.
Folders created via WB becomes a figshare article of type 3 (Dataset).
For a folder article, WB shows all its contents.
For a file article, WB shows only the first file of the article.
WB supports figshare project and dataset as OSF project root but not figshare collection. Both cases have been tested during local regression tests. Please note that a dataset can belong to a project or be of its own.
As for figshare project as OSF project root, only one level of folder is allowed, which is the article of file type 3 Dataset.
As for figshare dataset as OSF project root, only files are allowed. A WB-figshare folder / figshare dataset cannot contain another folder / dataset.
Public / published figshare articles (folder and file) can only be read but not modified.
Uploading a file (no type) or creating a folder (type 3 Dataset) ends up creating a private article of respective type, even if the file is uploaded to a published folder.
Dev Tests
As usual, no comments indicates a PASS. Please test both figshare as root and dataset as root when eligible.
Getting metadata for a file / folder: tested along folder listing and file rendering
Downloading
Public files are downloaded using the figshare's S3 API without auth
Private files are downloaded using the figshare API with auth
Uploading
One file
Multiple files
Contiguous < 10 MB
Chunked >= 10 MB
DAZ
Private dataset
Public dataset
Deleting (not available for published / publish folders and files)
One file
Multiple files
One folder
Multiple folders
Folder
Creation (only when figshare project as root)
Upload: tested along with Uploading
Deletion: tested along with Deleting
Rename files and folders: N / A, disabled by the front-end
Verifying non-root folder access for id-based folders: not necessary but tested anyway
Intra move / copy: N / A
Inter move / copy (light testing only)
One and multiple files
One and multiple folders
From OSFStorage to figshare
From figshare to OSFStorage: must test both private and published
Within figshare (intra disabled, thus test with inter): must test both private and published
Trying to move a published file or folder ends up being a copy
Comments persist with moves (light testing only)
If enabled, test revisions: only seeing the latest, which is as expected.
Project root is storage root vs. a subfolder: not necessary but tested
Updating a file
Extra Notes for QA Testing
At the time of writing the QA notes, prod OSF does not have the figshare article type fix while staging1 and staging2 has just been fixed. In short:
prod figshare may be more broken than staging1 and staging2
feature/oathpit (i.e. staging3 after merge) figshare works almost perfectly
Coverage increased (+6.8%) to 76.117% when pulling 4f38829c8250937709bcfd0215acd2b88f5e245d on cslzchen:feature/oathpit-figshare into be3b81f989cdb44de0b213635bd9073351297cf5 on CenterForOpenScience:feature/oathpit.
Ticket
https://openscience.atlassian.net/browse/ENG-330
Purpose
Enable and update the figshare provider for
aiohttp3
.Changes
aiohttp
0.18 -> 3The figshare provider directly calls
aiohttp.request()
inaiohttp
0.18 for some downloads, which no longer works withaiohttp-3.5
. Instead, use the updatedsuper().make_request()
. However, this only partially fix the download issue.Authorization
HeaderTo fully fix the download issue, the figshare auth header are dropped for published / public files.
As for the figshare provider: When a file has been published, the figshare download request returns a redirection URL which downloads the file directly from its backend storage S3. The new request does not need any auth at all. More importantly, the default figshare auth header breaks the download since S3 API does not understand figshare API. Thus, the provider must use
no_auth_header=True
to informsuper().make_request()
to drop the header.As for the core provider: Instead of modifying how headers are built with
build_headers()
, simply updatemake_request()
to drop theAuthorization
header built into the headers.Broken Celery Task (Existing Issue on
staging
)Move / copy fails for the figshare provider since the
__new__()
in theFigshareProvider
class does not accept extra arguments such asis_celery_task
which was recently added for celery tasks. The fix is simply adding the missing**kwargs
. Here is the related commit:add flag to BaseProvider to mark if it's running in celery
.Download Stream Size (Existing Issue on
staging
)Fixed missing stream size when copy / move private files for figshare.
Copy / Move private files / datasets from figshare to OSFStorage or any working provider including figshare fails. The problem is that the downloaded stream from the src-provider has a None size during copy / move, which breaks the upload-to-dest step which requires the size.
The root problem: the download response for private files using the figshare API does not provide the
Content-Length
header that is used to build the response stream and to set the stream size. On the contrary, the download response for published / public files using figshare's backend S3 API provides the Content-Length header.The fix is to retrieve and save the file size in advance during the metadata request and then create the response stream using his size if the
Content-Length
header is not provided.Side effects
Fixed move / copy which is broken on both
staging
andprod
.QA Notes
About figshare and WB-figshare
figshare is very different from other providers due to the "article type" concept.
WB supports figshare project and dataset as OSF project root but not figshare collection. Both cases have been tested during local regression tests. Please note that a dataset can belong to a project or be of its own.
Public / published figshare articles (folder and file) can only be read but not modified.
Uploading a file (no type) or creating a folder (type 3 Dataset) ends up creating a private article of respective type, even if the file is uploaded to a published folder.
Dev Tests
As usual, no comments indicates a PASS. Please test both figshare as root and dataset as root when eligible.
Getting metadata for a file / folder: tested along folder listing and file rendering
Downloading
Uploading
DAZ
Deleting (not available for published / publish folders and files)
Folder
Rename files and folders: N / A, disabled by the front-end
Verifying non-root folder access for id-based folders: not necessary but tested anyway
Intra move / copy: N / A
Inter move / copy (light testing only)
Comments persist with moves (light testing only)
If enabled, test revisions: only seeing the latest, which is as expected.
Project root is storage root vs. a subfolder: not necessary but tested
Updating a file
Extra Notes for QA Testing
At the time of writing the QA notes,
prod
OSF does not have the figshare article type fix whilestaging1
andstaging2
has just been fixed. In short:prod
figshare may be more broken thanstaging1
andstaging2
feature/oathpit
(i.e.staging3
after merge) figshare works almost perfectlyDeployment Notes
No