Refactor batch download to avoid using local disk

jsjiang commented 10 months ago

Refactor batch download to avoid using local disk. Possible solutions:

store file on S3 and generated a URL for download
generate a downloadable file on the fly

Tasks

[x] Obtain S3 access from EZID instances
[x] Create S3 buckets for batch download
[x] Modify EZID API to use S3 for batch download
[x] Modify EZID UI to use S3 for batch download
[x] Investigate if the "Download report in CSV format" in the Dashboard page requires refactoring
[x] Code review
[x] Test
[x] Deploy

jsjiang commented 10 months ago

Related to ticket #99

jsjiang commented 8 months ago

Current workflow:

user sends a download request via the download_request API
the download_request API creates a request to the download queue
the download_request API returns user an URL pointing to the downloadable file (not yet created)
the asynchronous download queue generates the report and put it in a location that matches to the downloadable URL
user retrieves the report through the downloadable URL

jsjiang commented 8 months ago

Solution option 1 "store file on S3 and generated a URL for download" requires minimum code change and is a good candidate for the first round refactoring without changing queue services.

jsjiang commented 8 months ago

subtasks:

552

jsjiang commented 8 months ago

Investigate if the "Download report in CSV format" in the Dashboard page requires refactoring.

jsjiang commented 8 months ago

The "Download report in CSV format" link calls the "impl.ui_admin.csvStats" function which uses "io.String" and "django.http.HttpResponse". This workflow does not save file on local disk and does not require refactoring with S3 implementation.

f = io.StringIO()
    w = csv.writer(f)

return impl.ui_common.csvResponse(f.getvalue(), fn)

def csvResponse(message, filename):
    r = django.http.HttpResponse(message, content_type="text/csv")
    r["Content-Disposition"] = 'attachment; filename="' + filename + '.csv"'
    return r

jsjiang commented 8 months ago

Batch download API:

url="https://ezid.cdlib.org/download_request" => impl.api.batchDownloadRequest => impl.download.enqueueRequest(user, request)

Batch download UI function:

url="https://ezid.cdlib.org/manage/download_confirm" => impl.ui_manage.download =>impl.download.enqueueRequest(user, request)

jsjiang commented 8 months ago

Email notification after request has been submitted:

Thank you for using EZID to easily create and manage your identifiers. The batch download you requested is available at:

https://ezid-stg.cdlib.org/download/Cn8XypqbcfvFRHdr.csv.gz

The download will be deleted in 1 week.

Best,
EZID Team

This is an automated email. Please do not reply.

jsjiang commented 8 months ago

The downloadable file for example https://ezid-stg.cdlib.org/download/Cn8XypqbcfvFRHdr.csv.gz is managed by the web server which give public access to the downloaded file.

One of the S3 solution option is to create a new API which resolves this link to a presinged URL: https://ezid-stg.cdlib.org/download/Cn8XypqbcfvFRHdr.csv.gz =>Define a presigned URL to the file (Cn8XypqbcfvFRHdr.csv.gz) saved in S3 => return the url for user to download the file

jsjiang commented 7 months ago

Created release tag V3.2.5
Deployed and tested on ezid-stg
Deployed and verified on ezid-prd (2/22)

CDLUC3 / ezid

Refactor batch download to avoid using local disk #521

552