Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.43k stars 115 forks source link

colossus: add s3 compatible object storage backend #4981

Open mnaamani opened 1 year ago

mnaamani commented 1 year ago

Background

With the explosive growth in demand on the storage infrastructure, and as suggested on multiple occasions by Storage lead, scaling the storage capacity of the storage node can be achieved by storing data objects on an object store built on top of clustering technology which allows dynamic growth of the capacity with limited disruption. eg. AWS S3, Ceph cluster + Object Storage, S3 compatible object stores from other cloud providers

Proposal - add support to store objects in object store.

Notes

zeeshanakram3 commented 1 year ago

Pick an s3 client package and test against multiple cloud providers Do we want to support more than one object store at a time?

@mnaamani there is a an NPM https://github.com/pkgcloud/pkgcloud#storage package that provides a unified interface to all/most of the obejct storage colud services, Maybe we can look into this and see if it meets the requirements

mnaamani commented 12 months ago

Adding links that might be useful for testing/development:

https://ytykhonchuk.medium.com/mock-amazon-s3-bucket-for-local-development-889440f9618e https://github.com/localstack/localstack https://dev.to/arifszn/minio-mock-s3-in-local-development-4ke6

ignazio-bovo commented 8 months ago

I have rewritten your points @mnaamani, to make sure I understand what you are saying

Rationale

The usage of Colossus storage is reaching levels that are challenging to manage with standard retail Bare Metal Storage Options, primarily due to the excessive storage capacity demands. The proposal suggests leveraging a cloud storage provider for hosting the joystream-storage volume, enabling operators or the Lead to set a maximum storage capacity requirement on a Colossus server.

Object Request Flow

Below is a diagram illustrating the flow for a GET /storage-api/v1/assets/X request:

graph LR;
    Argus[Argus] --> Colossus[(Colossus)] --StorageAPI--> CloudStorage[(CloudStorage)]; 

Decision Points

Caching Policy

Choice of Storage API

Storage Bucket Concept: A storage bucket is a primary container for data, files, and objects in cloud storage services.

Bucket Access for Colossus Nodes

Open Questions

kdembler commented 8 months ago

Quick thoughts:

  1. Like mentioned on Discord I think it's best to start small and only introduce cloud archive mode first. By that I mean a node that doesn't accept uploads directly, but only synchs objects from other storage providers and then stores them in S3. I think that is the most immediate need as it would allow us to safely reduce replication rate, possibly greatly reducing storage costs. Then we could iterate on the full version that can further reduce cost. Unless there's significant overhead of doing those separately.
  2. For the library, I think we should be fine with official AWS S3 SDK. There are other providers that offer S3-compatible storage and the lib is surely well maintained and documented.
  3. We need to handle HEAD for all assets. Maybe for remote objects they could be resolved just by querying squid, without accessing the actual file.
  4. Something to keep in mind is minimizing amount of operations executed against S3 because each is paid.
  5. S3 is block storage and glacier is a tier of S3 storage. It's designed as archival storage, for objects you need to access very infrequently. It has price rates adjusted for cheap long-term storage. https://aws.amazon.com/s3/storage-classes/glacier/ For that reason node using glacier may not want to be synched from at all, sort of in write-only mode.
  6. Regarding /data/temp - it's a store for pending uploads. Once the file is fully uploaded, it's moved to the pending folder until it's accepted on-chain. I don't think it's a mechanism for stopping multiple uploads of same file.
ignazio-bovo commented 8 months ago

By that I mean a node that doesn't accept uploads directly, but only synchs objects from other storage providers and then stores them in S3.

ok so this means that during the synching process, instead of locally downloading the assets the Colossus node stores them into S3, right? And this process should cost to the operator less as possible in terms of AWS billing

kdembler commented 8 months ago

I think the objects still need to be downloaded local first so their hash can be verified

ignazio-bovo commented 8 months ago

AWS SDK is available also in typescript https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/#Usage_with_TypeScript So this means that we are first rolling out an initial version where we just synch to glacier storage, right? @kdembler

Questions

  1. Would this also mean that are we supporting just s3 for the moment?
  2. Can this feature be optional, let's say if I am an operator and I decide not to provide a s3 bucket then the synch just happens on the Orion local storage, right?
mnaamani commented 8 months ago

On Caching policy I say there shouldn't need to be any caching done in colossus.

That said for a current operating transitioning to s3, there may be a period where it might serve objects from its current local store if they have not been moved to s3 yet.