ga4gh-beacon / specification

GA4GH Beacon specification.
Apache License 2.0
32 stars 25 forks source link

Handover method for data access #114

Closed mbaudis closed 5 years ago

mbaudis commented 6 years ago

As was discussed previously and with some assigned future spot on the Beacon roadmap, there is general consensus about the need to implement a specification for a handoff protocol. The arguments supporting this development can be summarized as:

As starting point for discussions on the merits of this concept and how to implement it, we have prototyped a very basic version of a handoff concept (without use of a proper authentication procedure):

Beacon+ query => internal matched variants ==> internal retrieved callset ids ===> internal storage of callset ids in record in tmp database ====> external delivery of BeaconDatasetAlleleResponse + info.callset_access_handle

Data access => callset_access_handle is submitted to authentication system ==> authentication procedure + fwd of callset_access_handle ===> data retrieval options based on authentication status

As part of the Beacon specifications, it should probably be sufficient to define an attribute name/format for the access handle; authentication etc. would be for demonstrators, "discovery" product ... but probably out of scope for the Beacon protocol itself (?).

See Beacon+ => CNV example => handover in response table => ...; the current concept is detailed in these slides.

mbaudis commented 6 years ago

Following the comment/request #157 from @mfiume, we should work on specifying the Handover structure with e.g. the DOS use case.

Working assumptions for the structure of the Handover protocol extension now are that:

The Handover implementation proposal addresses #157 and is related to #107.

As reminder, a simplified implementation has been prototyped for the Beacon+ resource and is conceptually documented here, though the format of the Handover object is assumed to be an object instead of the callset_access_handle used in the demonstrator.

mfiume commented 6 years ago

@mbaudis what if the object lives on a different server from the one that is generating the response?

Here, is the access key used to ID the object or does it comprise the authentication information required to fetch it, or both?

What about having a url in the handover struct to point to the payload?

Can you provide an example of how the authentication procedure would be provided? I agree that this would be very helpful to encode as a hint, just wondering how you'd approach it.

mbaudis commented 6 years ago

@mfiume I don't think that this would be part of Beacon, but the general idea would be that the "handoff" key would point to whatever action is then executed. It doesn't really matter which server the data resides on; this is resolved from data_access_handle and selected "action". The Beacon itself could expose a vocabulary of actions, so that a distributed query could e.g. be run over many nodes.

Sure, the handover object could be a url; but the url should not provide ids or such, just point to a resolver which can then extract which data object are pointed to. Basically the same as above, with

    url: "https://beacondeliver.mygenomecollection.org/handover/30822e80-8ef8-4ac9-af5d-304aa7f8c1dd"

instead of

    access_key: "30822e80-8ef8-4ac9-af5d-304aa7f8c1dd"

Authentication could be provided in OAuth etc., and the resolver would match credentials to access rights. This would allow the layered access of public beacon query + limited data retrieval.

Our testbed implementation

In our current implementation, the callset_access_handle points to a temporary DB, where the document has then the details:

Now data can be retrieved by creating different style queries from this.

1. Getting the callset ids:

db.querybuffer.findOne({_id:'966fc3c2-5a11-11e8-bf6d-8f10af00a547'})

... would deliver the document shown. This has its own:
* database and collection to query 
    - `"query_db" : "arraymap_ga4gh"`
    - `"query_coll" : "callsets"`
* attribute name
    - `"query_key" : "id`
* attribute values
    - `"query_values" : [ ... ]`

If you now follow the original GA4GH schema, you can retrieve e.g. all biosample ids by querying:

db.callsets.find({id:{$in:["PGX_AM_CS_GSM511473","PGX_AM_CS_GSM188255"]}},{biosample_id:1})



... etc., and the get the biosample data; similar for all variants from the matching callsets etc.

But this requires a standardised data structure in the `handover` delivery (here the GA4GH schema - which we use); or one starts to define other endpoints (and provides this with the Beacon response's handover info).

It is all rather trivial, if keeping to the basic principles of a schema which had been developed over years, without enforcing some of the more esoteric "recapitulate VCF column format" ideas of it.

Oh well...
mbaudis commented 5 years ago

Updated scenario: Providing a url + label handover list for direct access to the identified resources;

We have now implemented this scenario, for "one click" actions, based on the variants/callsets/samples identified in the Beacon query.

Example (this is the excerpt from the Beacon response):

"datasetAlleleResponses": [
  {
    "callCount": 163,
    "datasetId": "arraymap",
    "error": null,
    "exists": true,
    "externalUrl": "https://beacon.progenetix.org/beacon/info/",
    "frequency": 0.157,
    "handover": [
      {
        "action": "create CNV histogram from matched callsets",
        "label": "Histogram",
        "url": "/beaconplus-server/beacondeliver.cgi?do=histogram&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "export all biosample data of matched callsets",
        "label": "Biosamples",
        "url": "https://beacon.progenetix.org/beaconplus-server/beacondeliver.cgi?do=biosamples&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "export all variants of matched callsets",
        "label": "Callsets",
        "url": "/beaconplus-server/beacondeliver.cgi?do=variants&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "retrieve matching variants",
        "label": "Variants",
        "url": "/beaconplus-server/beacondeliver.cgi?do=variants&accessid=2a01d0bc-dc49-11e8-a927-a8c3673772cb"
      }
    ],
    "info": {
      "callset_access_handle": "2a0136df-dc49-11e8-a927-8d34da1c5bc0",
      "description": "The query was against database \"arraymap\", variant collection \"variants\". 163 matched callsets for 152 distinct variants. Out of 51820 biosamples in the database, 1038 matched the biosample query; of those, 163 had the variant.",
      "payload": null
    },
    "sampleCount": 163,
    "variantCount": 152
  }
],

This

mbaudis commented 5 years ago

@sdelatorrep I would suggest adding also a label attribute to the handover object.

Reasoning:

Also, this would be an interesting scenarion in which we have to decide if we should implement the general OntologyClass concept., which finds its way in other parts of GA4GH schemas. So, the schema could then look like:

Handover:
  type: object
  required:
    - type
    - url
  properties:
    type:
      type: object
      required:
        - id
      properties:
        id:
          type: string
          description: The use of an ontology term, in CURIE syntax, is strongly recommended. Use “CUSTOM” when no ontology is available.
          default: CUSTOM
        label:
          type: string
          description: A short label for the handover action. In the case of an ontology, this would be the "preferred Label".
    url:
      type: string
      description: URL endpoint to where the handover process could progress (in RFC 3986 format).
    note:
      type: string
      description: Additional human readable information or description about the handover.

(The type here is a bit confusing, both as attribute name and as keyword... Alas, this is just for discussion.)

sdelatorrep commented 5 years ago

Hi @mbaudis , looks good! Though we think it's not necessary to create an object for the field type. Check our proposal in PR #230, please.

sdelatorrep commented 5 years ago

PR #230 merged.