ga4gh-discovery / ga4gh-case-discovery

A framework for searching genomic data sharing services
Apache License 2.0
8 stars 5 forks source link

Review RC5 feedback for common issues and propose solutions #78

Closed Relequestual closed 5 years ago

Relequestual commented 5 years ago

After feedback has been collected for RC5, review this feedback as a group, looking for common issues, and document possible solutions.

Blocked by #77

Relequestual commented 5 years ago

@jhkbg Your feedback document at https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/master/documents/RC5-feedback/CaseLog.md details that you added...

  1. Metadata on for the request to include contact information
  2. Additional metadata on responses to include information on the associated study or cohort
  3. For each query, include the total number of samples; or count of non-matching samples
  4. Return the counts of matching/non-matching samples grouped by phenotype

Regarding 1 and 2.

There is an existing schema for contact information in responses per record at https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/master/json_schema/schemas_source/components/record_meta/contact.yaml

Allowing contact information to be included in the request is a fairly trivial change.

Do you think it would be useful to be able to group results?

You could then apply contact information per group of result as opposed to per individual record. (For DECIPHER, we are likely to have different information per result, and few groupings like this, but it sounds like your use case expects to have a number of grouped responses, right?)


Regarding 3 and 4, would you be able provide samples of this data? Preferably, also a JSON Schema? (If not, I'm happy to help make one, but you'll need to confirm it supports the data you have.)

Relequestual commented 5 years ago

@colinveal Your feedback document at https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/master/documents/RC5-feedback/CafeVariome.md you detailed that you...

  1. Added authentication content to the payload
  2. Did phenotype matching using a similarity score
  3. Added an Entity-Attriute-Value query component for more powerful queries
  4. Returned multiple datasets
  5. Returned differing authentication responses for datasets, even if no records were returned

In terms of 1.

It looks like you've added a Cafe Variome specific security object to the payload.

My feeling was to initially provide a standardised approach to allow generic provision of a barer token to services. And at a later point, when there was more feedback on use cases and potential consensus on approach (Maybe also across other GA4GH services), to revisit authentication.

In the current specification...

This specification is expected to support:

Open data sources that do not require authentication
Protected data sources that do require authentication

In the situation where a server requires authentication for access, the client MUST provide an authentication token located in a header field X-Auth-Token.

The authentication token MAY be a predefined token generated by the server specifically for the client, which they MUST exchange via a secure method such as GPG encryption using public and private keys.

The authentication token MAY be an (O-Auth 2.0 Bearer Token)[https://oauth.net/2/bearer-tokens/] provided in response to an O-Auth exchange, although how to achieve this is not specified. O-Auth 2.0 is recommended by GA4GH for authentication and authorisation purposes.

https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/11b2bff9463fc3d48601014858ee567c1b32f75f/specification.md#security

I read this to understand that a client may pre-auth (assumingly using oauth or similar) to a specific service in order to attain the barer token, which can then be bundled in the HTTP headers for the specific service. This approach keeps the payload consistent rather than needing to make payload changes per queryable service (which is IMHO the preference).

It looks like as well as a bearer token, you have required a specific API key to access the query endpint. Is this the case?

GA4GH Security group propose OAuth 2 or OpenID connect.

Given a bearer token can be any string, you could encode your fields in a JWT, which would include the bearer token issued by OAuth 2.

Would you be amenable to a proposed change of specifying that the authentication token MAY be a JWT which encodes the OAuth 2 token with any additional tokens or metadata?

My feeling is we should not define any hard rule on this till more groups with authentication concerns are able to weigh in. Does that sound OK?


Regarding 2.

Hum. The Case Discovery API is not designed to do "matching" but to allow people to ask a specific question. This is explicitly different to how MME works, where the end user has zero control of the query mechanisms. That said, I had invisaged a search component later that allowed the client to specify a more fuzzy logic searching function. Is this what you've proposed here?

I'm curious, the current live version of Cafe Variome Central doesn't offer this sort of query method. Is this something you're adding?

Such functionality may be a lot to ask for a first version, but could be viable for a future version.

MME detailed a UI score approach, but left each system to their own implementation.

Has there been any analysis as to the most performant algorithms to use?


Regarding 3.

My feeling on this is we would not look to include this as part of the standard for a first version release. It hugley increases the barrier to entry, making it more complex for implementers. This is much simpler for you as it aligns with your existing implementation.

That being said, it would be fine for you to publish additional schemas for non-standard components you support.


Regarding 4.

It looks like being able to return multiple sets of data as opposed to just ONE set of data is a common requirement from implementers.

I'll assume the example response structure is your current API structure as by means of an example of how you'd like to return data inside the Case Discovery API response JSON structure.


Regarding 5.

My preference on this would be to define a structure in the response meta which expresses that additional data repositires are known but could not be queried, with provided reasons.

I would not want to see data sets which were inaccessable to be included in a response or records object. My feeling is in doing so, it would lessen the burden to implementers in terms of processing response records if they do not provide the ability for a client to perform further authentication.

Having said this, it might be this should remain a proprietary component defined by you till there are other services which wish to do something simlar (so there's some sort of consesus).

colinveal commented 5 years ago

@colinveal Your feedback document at https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/master/documents/RC5-feedback/CafeVariome.md you detailed that you...

1. Added authentication content to the payload

2. Did phenotype matching using a similarity score

3. Added an Entity-Attriute-Value query component for more powerful queries

4. Returned multiple datasets

5. Returned differing authentication responses for datasets, even if no records were returned

In terms of 1.

It looks like you've added a Cafe Variome specific security object to the payload.

My feeling was to initially provide a standardised approach to allow generic provision of a barer token to services. And at a later point, when there was more feedback on use cases and potential consensus on approach (Maybe also across other GA4GH services), to revisit authentication.

In the current specification...

This specification is expected to support:

Open data sources that do not require authentication
Protected data sources that do require authentication

In the situation where a server requires authentication for access, the client MUST provide an authentication token located in a header field X-Auth-Token. The authentication token MAY be a predefined token generated by the server specifically for the client, which they MUST exchange via a secure method such as GPG encryption using public and private keys. The authentication token MAY be an (O-Auth 2.0 Bearer Token)[https://oauth.net/2/bearer-tokens/] provided in response to an O-Auth exchange, although how to achieve this is not specified. O-Auth 2.0 is recommended by GA4GH for authentication and authorisation purposes.

https://github.com/ga4gh-discovery/ga4gh-case-discovery/blob/11b2bff9463fc3d48601014858ee567c1b32f75f/specification.md#security

I read this to understand that a client may pre-auth (assumingly using oauth or similar) to a specific service in order to attain the barer token, which can then be bundled in the HTTP headers for the specific service. This approach keeps the payload consistent rather than needing to make payload changes per queryable service (which is IMHO the preference).

It looks like as well as a bearer token, you have required a specific API key to access the query endpint. Is this the case?

GA4GH Security group propose OAuth 2 or OpenID connect.

Given a bearer token can be any string, you could encode your fields in a JWT, which would include the bearer token issued by OAuth 2.

Would you be amenable to a proposed change of specifying that the authentication token MAY be a JWT which encodes the OAuth 2 token with any additional tokens or metadata?

My feeling is we should not define any hard rule on this till more groups with authentication concerns are able to weigh in. Does that sound OK?

Regarding 2.

Hum. The Case Discovery API is not designed to do "matching" but to allow people to ask a specific question. This is explicitly different to how MME works, where the end user has zero control of the query mechanisms. That said, I had invisaged a search component later that allowed the client to specify a more fuzzy logic searching function. Is this what you've proposed here?

I'm curious, the current live version of Cafe Variome Central doesn't offer this sort of query method. Is this something you're adding?

Such functionality may be a lot to ask for a first version, but could be viable for a future version.

MME detailed a UI score approach, but left each system to their own implementation.

Has there been any analysis as to the most performant algorithms to use?

Regarding 3.

My feeling on this is we would not look to include this as part of the standard for a first version release. It hugley increases the barrier to entry, making it more complex for implementers. This is much simpler for you as it aligns with your existing implementation.

That being said, it would be fine for you to publish additional schemas for non-standard components you support.

Regarding 4.

It looks like being able to return multiple sets of data as opposed to just ONE set of data is a common requirement from implementers.

I'll assume the example response structure is your current API structure as by means of an example of how you'd like to return data inside the Case Discovery API response JSON structure.

Regarding 5.

My preference on this would be to define a structure in the response meta which expresses that additional data repositires are known but could not be queried, with provided reasons.

I would not want to see data sets which were inaccessable to be included in a response or records object. My feeling is in doing so, it would lessen the burden to implementers in terms of processing response records if they do not provide the ability for a client to perform further authentication.

Having said this, it might be this should remain a proprietary component defined by you till there are other services which wish to do something simlar (so there's some sort of consesus).

@Relequestual Regarding 1. this sounds fine.

Regarding 2. Sorry I wasn't clear about this, but this isn't a hands off matching process. The component allows the user to explicitly state which similarity measure and associated parameters, for example "resnick" and an MICA > 2, and can of course be logically combined with other 'non-matching' components. We have a number of customised Cafe Variomes for specific projects, one of which requires similarity matching and since we have adopted the api needed a way to transmit the queries. We currently have "Resnick", "Lin", "Relative", "Jian-Congrath" and good old "Jaccard" similarity measures available to users.

Regarding 3. In a way this is just a "catch all" for items that don't fit into other components, so as the component library is enriched the need for this would decrease.

Regarding 4. Not too worried about the format as long as there is a way to return multiple datasets, happy to try alternatives.

Regarding 5. Your preference to use reponse meta sounds good. Unless others are also interested and wanted to have a standard set of codes.

Relequestual commented 5 years ago

I've created two new issues, documented at https://github.com/ga4gh-discovery/ga4gh-case-discovery/issues/79