HumanCellAtlas / data-store

Design specs and prototypes for the HCA Data Storage System (DSS, "blue box")
https://dss.staging.data.humancellatlas.org/
Other
40 stars 6 forks source link

Put file into DSS from another cloud location without copying it into DSS Spike! #912

Closed mikebaumann closed 5 years ago

mikebaumann commented 6 years ago

As a user of the DSS in a project other than the HCA, I need to put files into the DSS without copying the file into the DSS. I have access to a tremendous volume of data available at stable cloud locations that I would like to make available through the DSS, yet cannot afford the ongoing storage cost of duplicating all that data in the DSS. My access to the source data is read-only.

The primary use case is for large data files, being able to load metadata files in this same way may be nice/useful in some cases, yet is lower priority. A given bundle may consist of both files loaded by reference, and metadata files that are copied into the DSS.

Once this data has been loaded by reference into the DSS, I would like access to the data through the DSS to be as consistent/transparent as possible. This includes get file, get bundle (with directurls), indexing, search, subscription/notification, and checkout.

Ideally the files loaded by reference would be available from all replicas, yet the syncing, etc. required to achieve this probably warrants further discussion/design.

briandoconnor commented 6 years ago

I think there's a lot to think about here (specifically the sync issues caused by this). In many ways you're asking for the DSS equivalent of a symlink. It may seem like a strange use case but, for projects like TOPMed that want to control and pay for storage on clouds, this functionality would allow us to represent the data via DSS APIs without duplicating data. We can assume the DSS service account has read access to the bucket that hosts the data.

hannes-ucsc commented 6 years ago

I don't think this should work on the file level but rather the blob level. I would call this feature Foreign Blobs.

mikebaumann commented 6 years ago

For controlled access data, it will also be important to consider which users/processes have access to the data, and which do not.

mikebaumann commented 6 years ago

Yes, Foreign Blobs is a good internal term for the feature. This story was intended to be written from a user perspective, and users put files, not blobs.

hannes-ucsc commented 6 years ago

How do we guarantee immutability without making a copy?

I tend to think that we can't avoid at least one copy.

briandoconnor commented 6 years ago

hey @hannes-ucsc the specific use case we're thinking is cost prohibitive to make any copies. In this case TOPMed has 180K WGS (5.4 petabytes) of data that they've mirrored between a bucket on AWS and GCP. So they're paying about $200K per month to provide this. They're going to give access to read-only IAM users to around 1/2 a dozen groups to "onboard" this data in their respective systems and make the data available to their users. In our case for us that's a blue box instance with an interactive green box provided by the Broad (FireCloud).

Think about how you could make this work without copies but in a generic and flexible way you all don't hate. For example, you could create files in data bundles that actually just contain the URL (and potentially other info like checksums to at least identify inconsistencies) and indicate in blue box these are "links" via a specific tag on the file. And when a user makes a request to retrieve this file the blue box understands links and how to resolve them correctly as a signed URL or native, original path. I'm totally making this up so devs should discuss and decide on the best approach. But this somewhat strange feature would actually be a very powerful way to get blue boxes in front of existing datasets for quite a few NIH projects (a lot of them will follow this same patter). We're likely to see this for Anvil, HTAN, Commons, etc.

hannes-ucsc commented 6 years ago

I see. I agree that it is a very useful use case.

If storage cost is the driving factor, doesn't that preclude replication of foreign blobs to other replicas?

kislyuk commented 6 years ago

Support for this feature introduces fundamental conflicts with the existing DSS architecture, its data model, data availability and integrity guarantees, and its feature roadmap.

At this point I don't think it's appropriate to develop this feature in the DSS mainline. I don't think we should prioritize this within HCA in the near term. We could speculatively consider a design that admits foreign blobs with the minimum amount of breakage, but currently I'm not sure where it fits within the scope of this project, and it would carry a serious added infrastructure and development complexity burden.

I think it would be far more fruitful to find a way to adapt DSS to other projects' needs, or find a way for other systems to present a compatible API that can be used with the same client side tools, or otherwise find a roadmap that converges the design of these other large scale data storage systems with DSS.

For these other external S3 or GS based data sources, what's the impediment to using them directly?

ttung commented 6 years ago

If this had to be built, the easiest way is to build a new type of BlobStore/HCABlobStore that hides the remoteness of files and treats some copy operations as noops. Essentially nothing else would change.

Having said that, I share @akislyuk's concerns.

bkmartinjr commented 6 years ago

@mikebaumann - can you elaborate on what aspects of the DSS you hope to leverage (vs some other storage system, eg, "plain old S3")? As I read the thread, you are looking to take advantage of some DSS features, but not others (principally, no replication, no data integrity/versioning, etc). Please say more about why the DSS is attractive (as opposed to simply using a bucket API). Thanks!

briandoconnor commented 6 years ago

Unstructured ideas of why the blue box vs. just a bucket, not comprehensive but just a brain dump. I'll add more as I think of them:

mikebaumann commented 6 years ago

@bkmartinjr I view the DSS as general purpose bundle oriented storage system with versioning. The bundles provide uniquely identified and versioned sets of data and metadata, and a set of features for finding and accessing them. The features the DSS currently provides are applicable and useful to a wide variety of projects needing this type of storage. My original statement:

Once this data has been loaded by reference into the DSS, I would like access to the data through the DSS to be as consistent/transparent as possible.

was meant to convey the desire that the full set of DSS features, to the extent technically possible, would be available for bundles that include foreign reference files, as they are for any other bundle. The features I listed were intended to be representative examples of this full set, and not represent specific priorities or the exclusion of other features/capabilities.

We briefly discussed some of the issues and implications of this request after the HCA DSS stand-up today, and will follow-up with additional discussions, design options, etc. starting in a week or two.

cricketsloan commented 5 years ago

In case this is useful for you, it is the DataCommons use case for this type of feature

Purpose: sometimes we want to provide access to data files and index the metadata that is in other peoples' buckets rather than keep copies for ourselves.

Concern: the links will break.

Objective: API's to make the interface seamless between the "real" files and the "stub" load by reference files.

kozbo commented 5 years ago

DoD: RFC with a list of stories