ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 52 forks source link

Support for data in cold storage #395

Open ianfore opened 1 year ago

ianfore commented 1 year ago

It has been suggested we work on support for cold storage in the DRS specification. Ahead of submitting a pull request for this. It seems worth laying out some of the considerations for support of cold storage. Not all considerations need be accommodated, but it helps to have an idea of the overall landscape.

'Cold' storage is a shorthand for the situation where an object is not immediately available to a 'get a URL' DRS request ( a request of the form /objects//access/). Hot/Cold is binary. The different storage tiers offered by many providers have more subtle gradations of availability. However for this discussion we will assume that hot/cold is sufficient, unless someone suggests it needs to be more complex.

e.g. something like the following pseudo-specification is likely necessary. The first and third access methods are hot, and the second is cold.

{
   "access_methods": [
      {
         "access_id": "e93724",
         "region": "ncbi",
         "type": "https"
      },
      {
         "access_id": "fbd466",
         "region": "gs.US",
         "type": "https",
         "storage": "cold"
      },
      {
         "access_id": "0da151",
         "region": "s3.us-east-1",
         "type": "https"
      }
   ],
   "checksums": [
      {
         "checksum": "044e759c2e430c3db049392b181f6f5a",
         "type": "md5"
      }
   ],
   "created_time": "2022-06-03T14:07:27Z",
   "id": "044e759c2e430c3db049392b181f6f5a",
   "name": "SRR000066.lite",
   "self_url": "drs://locate.ncbi.nlm.nih.gov/044e759c2e430c3db049392b181f6f5a",
   "size": 118588310
}
mattions commented 1 year ago

Hi @ianfore , thanks for raising this.

I think it would be important to have it there, given that we already have use-cases where the data are in cold storage, and it would be a nice way to differentiate between hot and cold).

So maybe it would be good to approach this into two phases:

Phase 1 -- signal what's cold form what is in hot storage as you proposed. build a PR around this one. Phase 2 --Move from cold to hot -- This is the one I would slate it for later.

For example, on our use-case with SRA, once we discover a DRS link is on cold-storage, we need to use the SRA cold Storage API and that require:

dbGAP does not want to bear the cost to have it stored in hot format, therefore there will not be any automatic transformation form a cold to hot state within the DRS.

That's why I think two Phases. Phase 1 is useful, well scoped and we can make a PR for the 1.4, given that is not breaking.

What do you think?

ianfore commented 1 year ago

For reference the cold data storage mechanism used for SRA and dbGaP that @mattions is referring to is the Cloud Data Delivery Service CDDS (https://www.ncbi.nlm.nih.gov/sra/docs/data-delivery/ ). It would help to bring the use cases and the reality of implementation to the discussion. That would establish the lay of the land for those interested in DRS. With everyone informed about that, we could address phasing.

The simple example I put above could indeed cover it, and would be an improvement on the status code used now when data is in cold storage. However, getting some group understanding the factors involved in requesting and responding to requests for thawing data seems sensible.

Note: the discussion in the CDDS page has the context of using SRA run selector for a discovery step. The API allows for a separation such that platforms and/or users can substitute discovery as is appropriate to their working environment or use case. This approach is the essence of a Data Fabric.

MichaelLukowski commented 1 year ago

Hey @ianfore and @mattions I think that this is a great addition to the DRS spec.

I agree with @mattions that 2 phases might be a good idea just to support cold storage in DRS with a simple spec change. Moving from hot to cold and how the spec would handle notifying a person that their data is ready is another challenge.

I think a session at the plenary would be a good place to get thoughts and inputs from others who are actively thinking about cold storage.

ianfore commented 1 year ago

Thinking about specification vs implementation. Could the sense of challenge presented by cold to hot be more in the implementation? The specification of the interface to request it, and how to handle the response, is likely simpler. The discussion is about the specification rather than the implementation

We'd also like to make sure the interfaces we provide to the dbGaP implementation work for the community. Using a standard (DRS) helps with that.

mattions commented 1 year ago

I agree, but at the moment the major obstacle I see is how to handle the transfer from the cold to hot, given the inherent variability.

AWS for example wants a certain amount of information, GCP others, and while it may get more common down the line, I think it would be a great benefit to sort out what's cold and what's hot, and then let the implementers and the DRS provider to figure out how to handle that part.

Once we have enough use-case and we see a pattern, we can then maybe extend the spec and see how to get that in as well.

NavidZ commented 11 months ago

I have a confusion about considering a "storage" to be just cold or hot. For example in today's GCP offerings there are multiple storage classes all of which are accessible with the same API with similar latencies and they just have different costs for storage space vs access to the data. So with the traditional view of a cold storage one might argue that GCP doesn't offer any cold storage at all. But same goes for S3 offering of AWS and different storage classes. However, AWS also offers another way of accessing the data called Flexible Retrieval (in Glacier) which is more inlined with CDDS in the sense that your job is batched and is done in like a couple hours or so. So speaking of these access methods why, aren't we looking at a storage solution that would require this "submit batch job" way of accessing the data as another protocol for accessing the data (similar to say ftp:// vs gs://) as oppose to an additional field on the access method that can only be hot or cold and also doesn't seem quite independent of the protocol?

Also talking about when requesting a data provider footing the bill for possibly both of hot and cold storages, one could argue that a request could be made to move the data to any particular storage class to any other ones (as I mentioned above in the example) and not necessary to the "hottest" (referring to typically the most pricey for the storage space and cheapest/fastest for the data access) and then I assume it will be up to provider to decide whether they accept it and then whether they also provide this new copy to the other users as well or not. So say in the case of GCS, even if we consider all the storage classes as hot (because they all have the same minimal latency for access), we could imagine a user may still want to request the provider to move the data to some storage class with lower data access cost if they could offer it. But going back to the original confusion putting it simply in cold vs hot seems a bit of an over simplification of what different clouds actually provide.

mattions commented 11 months ago

@NavidZ I understand where are you coming from, but we do not want to map the cold storage system on DRS.

For Phase 1, we just want to answer this question: Can the user can retrieve the file at the moment of the request?

If the user can do that, than the file is in "hot" storage, and can be used right away. If the user cannot retrieve the file, then the file is simply in not available. It is in some form of "cold" status.

Maybe, when we attach this in phase 2, we can think about having a better way to provide details on the storage, the expected time for thawing and all the other bits.

bcli4d commented 11 months ago

As I've previously stated, I'm concerned about the performance implication of reporting storage state in the access method. How can the DRS server know the storage state of an object except by querying the storage server on which the object resides? If the DRS server has to make such a request for each object in a bulk request, that's got to be a significant hit.

ianfore commented 10 months ago

@bcli4d - Bill, in the NCBI case we know where we have stored the data and have that stored in a database. The server is effectively querying that already. Performance seems fine. It's just reporting cold data as an error rather than giving the response above.

briandoconnor commented 6 months ago

Great to see this is moving forward!!

For NHLBI BDC driver project, would love to see the ability to say availability="immediate"|"delayed" in a DRS response.

The NHLBI group would like to store some data in cold storage and needs a way to provide that information about availability to our workflow systems (Terra/SBG) so they can present possible delays to end users and handle it appropriately.

dglazer commented 6 months ago

Summarizing some discussion from today's Cloud WS call, where we realized that "cold" wasn't a precise enough term:

briandoconnor commented 6 months ago

See https://docs.google.com/document/d/1hayvWLIoymomPI9oXcaTZirn5YxFv1cYAs70zyvlvnA/edit#bookmark=id.stgpsb1mhzwn for a Cloud WS session where this issue was discussed in detail.

briandoconnor commented 5 months ago

Could this look like?:

  "access_methods": [
  {
  "type": "s3",
  "access_url": {
  "url": "string",
  "headers": "Authorization: Basic Z2E0Z2g6ZHJz"
  },
  "access_id": "string",
  "region": "us-east-1",
  "accessibility": {
    "status": "available" or "unavailable" or "delayed",   <-- so this would let us so available now, not available at all (e.g. removed), or if this is unavailable instantly but is requestable 
    "delay": "Xms"  <-- this would let you specify how long the request might take 
  }
  "authorizations": {
  "drs_object_id": "string",
  "supported_types": [
  "None"
  ],
  "passport_auth_issuers": [
  "string"
  ],
  "bearer_auth_issuers": [
  "string"
  ]
  }
  }
  ],