DataverseNO / dataverse

Open source research data repository software
http://dataverse.org
Other
0 stars 0 forks source link

Direct Upload and Download Support #1

Open philippconzett opened 3 years ago

philippconzett commented 3 years ago

We want DataverseNO to support direct upload and download of (larger) files.

Preferably by using Dataverse support for S3 Direct Upload and Download as described in the Dataverse Developer Guide.

If S3 compliant support is not possible or does not work as efficiently as with AWS or Google, we might want to consider developing similar support for MS Azure blob storage.

See also notes from 2021-07-27 Dataverse Project Community Call about Large File Support through S3.

philippconzett commented 3 years ago

MS Azure offers three access tiers:

  1. Hot access tier, optimized for storing data that is accessed frequently, for example, images for your website.
  2. Cool access tier, optimized for data that is infrequently accessed and stored for at least 30 days, for example, invoices for your customers.
  3. Archive access tier, appropriate for data that is rarely accessed and stored for at least 180 days with flexible latency requirements, for example, long term backups.

I guess for the purpose of DataverseNO, where files may be accessed quite often, only option 1 is appropriate?

Louis-wr commented 3 years ago

Currently dataverseno only has hot storage Cool and Archive have significant data retrieval cost, it would be unwise to use them in public accessible data. Also Archive involves a 15 hour delay to retrieve the data.

philippconzett commented 2 years ago

@Louis-wr @rolfa015 What is the "thing" we are going to use in Azure to store the DataverseNO data files: a blob container, block blob, page blob, something different? And what will be the size limit we'll be able to store in that "thing"? Cf. https://docs.microsoft.com/en-us/azure/storage/blobs/scalability-targets.

rolfande commented 2 years ago

@philippconzett We will be using Azure Blob storage and data are stored as block blobs. Azure Storage Account (think of it as the storage server) can consist of several Blob containers (think of it as volumes) and a Blob container can consist of several blobs (think of it as files and directories). Limitations are: Each block blob can be maximum of 50,000 X 4000 MiB (approximately 190.7 TiB) A Blob container can be maximum the same size as a storage account. A Storage account can be maximum 5 PB I understand that we will enable NFS on the storage service so these issues are also relevant: https://docs.microsoft.com/en-us/azure/storage/blobs/network-file-system-protocol-known-issues

philippconzett commented 2 years ago

@rolfa015 Thanks! Do we need the NFS functionality given that MINIO makes our storage S3-compliant?

rolfande commented 2 years ago

@philippconzett If we stick to S3 also for "normal" storage then we don't need NFS. But maybe we need it for storage connected to the postgres DB. If we don't go for postgres DB as a service (outside the docker VM).

philippconzett commented 2 years ago

Update: We will be using UiO's Cloudian storage for the entire DataverseNO. Question: Should we create separate storage buckets for each collection?

Update 2: DataverseNO will hire GDCC to develop support for keeping folder structure also with direct upload to S3. Estimated to be in place by the end of September.