ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 52 forks source link

locations constraints on DRS Pointer #400

Open mattions opened 9 months ago

mattions commented 9 months ago

In CRDC driver Project and also in BioDataCatalyst we have a situation where the host of the data would like to provide a guidance on how to use the data, and there to use it.

In other words, they would like that any platform downstream of the DRS Server would compute on the data in certain cloud locations, which usually are the same where the data are from. The reasons for this request are different, going from keeping the egress cost down, to not having the data leaving the security level.

Given that at the end we have download url in DRS, and it would be pretty difficult to enforce the situation, therefore I suggest we go more towards an idea where the host "suggest" what is the preferred way to access the data, and the DRS client accessing these data honor the request to the best of their ability.

Proposal

The proposal aims to enhance the GA4GH DRS (Data Repository Service) specification by introducing a new field that provides metadata regarding the intended usage and location constraints for data objects. This additional field will allow data providers to specify their preferences and requirements for how the data should be accessed and utilized. The proposed field will offer the following options:

  1. Cloud Exclusive (_cloudexclusive): the data object is intended for use exclusively within a cloud environment. Users are expected to access and process the data only within a cloud computing infrastructure and not outside of it; cannot download the data on somebody's laptop

  2. Cloud Provider-Limited (_cloud_providerlimited): the data object should not leave the cloud provider's ecosystem. Users are restricted from moving the data to external locations or platforms. It must remain within the boundaries of the specified cloud provider.

  3. Cloud Region-Limited (_cloud_regionlimited): the data object is restricted to a specific cloud region. Users are required to access and process the data within the designated region and are prohibited from transferring it to other geographic locations within the cloud provider's infrastructure.

By introducing this new field, data providers and administrators can communicate their data access and usage policies more effectively, ensuring that data is handled in accordance with their specific requirements. This addition not only enhances the flexibility of the DRS specification but also strengthens data governance and compliance for genomic and health-related data in cloud-based environments.

It could look like this:

{ 
  "id": "string", 
  "name": "string", 
  "self_uri": "drs://drs.example.org/314159", 
  "size": 1024, 
  "created_time": "2019-08-24T14:15:22Z", 
  "updated_time": "2019-08-24T14:15:22Z", 
  "version": "string", 
  "mime_type": "application/json", 
  "checksums": [ 
    { 
      "checksum": "string", 
      "type": "sha-256" 
    } 
  ], 
  "usage_constraints": { 
    "access_type": "cloud_exclusive", 
    "location_constraints": { 
      "cloud_provider": "AWS", 
      "cloud_region": "us-west-2" 
    } 
  } 
} 

In this structure:

This structured metadata allows data providers to clearly communicate their data access and usage policies, ensuring that users are aware of the intended constraints. It also enables data consumers to make informed decisions about how to handle and access the data. The specific values for access_type can be defined in the DRS specification, and they should correspond to the proposed usage policy options. This structure helps promote consistency and interoperability across different implementations of the DRS specification.

ianfore commented 9 months ago

The CRDC driven work in fasp-scripts had this use case in mind. The basic model was to use DRS to find out where the provider (CRDC, BDC, Anvil, etc) had made the data available and "go with the flow" of running compute there rather than downloading.

The guidance is in essence provided by the provider by having the DRS service tell the consumer where the data is available.

Some providers didn't enforce the expectation that the consumer would compute in place. They expected the consumer to "go with the flow". Others made their buckets "requester pays" - which meant they weren't restricting where you did the compute - but you would have to pay if the consumer didn't go their preferred route - which is to compute on the data in place.

If we need the addition proposed here it might likely be better as an attribute of an access method - providing the constraints on usage in that particular location.

MichaelLukowski commented 9 months ago

I think that this is a valid concern for data that is being indexed by a DRS server however I am not sure that the GET /objects/{object_id} endpoint is the best location for the requested information. I tend to agree with @ianfore that this could be part of the access method flow. Perhaps this could be a optional field as a part of the OPTIONS /objects/{object_id}?

briandoconnor commented 5 months ago

Trying to mock this as part of the access method, could this be informational in this way:

Some things I included here:

kanchana404 commented 5 months ago
{
  "id": "string",
  "name": "string",
  "self_uri": "drs://drs.example.org/314159",
  "size": 1024,
  "created_time": "2019-08-24T14:15:22Z",
  "updated_time": "2019-08-24T14:15:22Z",
  "version": "string",
  "mime_type": "application/json",
  "checksums": [
    {
      "checksum": "string",
      "type": "sha-256"
    }
  ],
  "usage_constraints": {
    "access_type": "cloud_exclusive",
    "location_constraints": {
      "cloud_provider": "AWS",
      "cloud_region": "us-west-2"
    }
  }
}

In this corrected version:

  1. The usage_constraints section contains the access_type and location_constraints fields.
  2. access_type specifies how the data should be accessed and used.
  3. location_constraints provides additional details such as the preferred cloud provider (cloud_provider) and desired cloud region (cloud_region).
briandoconnor commented 5 days ago

In the Cloud WS meeting on Aug 12th, 2024 we decided to simplify the feature described in Issue #400 for DRS release 1.5.

PR #407, intended for DRS release 1.5, simply adds a string "cloud" to the access response. We now include cloud, region, and type information only… no cloud or geo location constraint support for example.

The fields we will include are:

After DRS 1.5 we can revisit how we express region, cloud, geo location, etc constraints in DRS which is a much bigger issue.