ga4gh-discovery / data-connect

Standard for describing and searching biomedical data developed by the Global Alliance for Genomics & Health.
Apache License 2.0
24 stars 14 forks source link

GA4GH Table Discovery API with limited search? #98

Closed ifokkema closed 3 years ago

ifokkema commented 3 years ago

Hi all, I'm aiming to add a GA4GH Table Discovery API on top of the LOVD3 software. However, I can't currently implement the GA4GH Search API for several reasons (too complex, will need a lot of work to ensure LOVD3 authorization rules are always applied, but I don't have the hours available to do this). Yet, I do need some simple filtering.

My initial idea was to provide support for:

I would have a variants.hg38 table as well. The decision to split this into two is because I'd need to paginate, and the most logical choice would be to paginate based on the genomic location (more on that below).

Simple filters that I'd want are:

  1. Filter on license attached to data,
  2. Filter on genomic location,
  3. Filter on gene symbol,
  4. Filter on new data only, so I won't have to reload all the data every time I query it.

Since filters are not part of the GA4GH Table Discovery API but part of the GA4GH Search API that I can't implement, I need to find a way around this without breaking the specs. I could achieve 1 and 3 by providing more tables, although this is of course not the intention of what "tables" are. I can achieve 2 through pagination which allows me to point to any URL, so I can add filters in there for the chromosomal location. Finally, I'm lost on how to implement 4 without implementing a filter that would automatically break the GA4GH Table Discovery's specs.

Any thoughts on this?

jfuerth commented 3 years ago

Thanks for these questions, and thanks for helping us understand your use case!

First I should ask, have you considered htsget? It might be a closer match for your requirements.

Either way, you've posed some interesting questions about how much filtering, caching, and optimization can be shoehorned into the static Tables side of the Search API.

Purpose of the Tables API

Up to this point, as you've observed, the Tables side of the API just provides information about which tables are available, what their rows look like, what their attributes mean, and an unfiltered dump of all the data. It was a design goal to make this as easy as possible to implement, so we left out anything approaching dynamic behaviour. For example, these implementations of the Search API are just carefully named JSON files in cloud buckets: Collections of Phenopackets Personal Genome Project of Canada. The SQL querying is provided via a Presto (now [Trino])(https://trino.io/)) server and the performance is adequate because the data is stored in the same cloud and region as the Presto instance. This setup is roughly equivalent to how Amazon Athena works. Given enough parallel processing, it can be fast without indexes.

Using pages as partitions

Natural partitions in the data (for example, by chromosome as you've said about) could certainly help a Trino connector ignore pages that wouldn't have any matches for the current filter, and could also help other consumers who are aware of the partition structure.

If you're up for it, we could explore an extension to the pagination system together which would let consumers understand what partitions exist in the data, which attributes the data is already sorted by, and even how to jump directly to the pages of interest. We've experimented with something very basic along these lines with the index attribute here.

Imagine something like this on a table, but also advertising information about partitions:

{
  "name": "hpo_phenopackets",
  "description": "Table / directory containing Phenopacket JSON files for hpo_phenopackets https://zenodo.org/record/3905420#.X3Sd2pNKj6h",
  "data_model": {
    "$id": "https://storage.googleapis.com/ga4gh-phenopackets-example/phenopacket-with-id",
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "Phenopacket JSON data model",
    "properties": {
      "id": {
        "type": "string",
        "description": "An identifier specific for this phenopacket"
      },
      "phenopacket": {
        "$ref": "https://schemablocks.org/schemas/sb-phenopackets/current/Phenopacket.json"
      }
    }
  },
  "index": {
    "ordered_by": [ "id" ],
    "pages": [
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_1",
        "partitions": {
          "id": {
            "min": "PMID:27435956-Naz_Villalba-2016-NLRP3-proband",
            "max": "PMID:29174093-Szczałuba-2018-GNB1-proband"
          },
          "subject.sex": {
            "min": "FEMALE",
            "max": "FEMALE"
          }
        }
      },
      {
        "url": "https://storage.googleapis.com/ga4gh-phenopackets-example/flat/table/hpo_phenopackets/data_2",
        "partitions": {
          "id": {
            "min": "PMID:26833330-Jansen-2016-TMEM199-F1-II2",
            "max": "PMID:27974811-Haliloglu-2017-PIEZO2-Patient"
          },
          "subject.sex": {
            "min": "MALE",
            "max": "MALE"
          }
        }
      }
    ]
  }
}

This would indicate the data is partitioned by id and subject.sex and that there are two partitions the data consumer could jump to.

I'm just making this up as an illustration of the idea; don't take the exact format too seriously.

Caching the partitions

We could add mentions in the spec about conditional HTTP requests (eTag/If-Match and Date/If-Modified-Since) to help clients take advantage of the built-in caching behaviour of HTTP. You would be able to cache partitions locally using any cache-aware HTTP client library, and it would automatically perform conditional requests to avoid re-fetching pages that haven't changed. This is also compatible with cloud storage APIs.

Making the data transfer smaller

You can also pre-gzip the pages of JSON and tell your web server to serve them with the content-encoding: gzip header. Any server should support this, including the major cloud storage APIs.

Enforcing authorization rules

Moving away from the static tables for a moment, and thinking about a SQL implementation.

With our Trino-based implementation, we have done a few proofs of concept.

  1. Respond to a query request with an HTTP auth challenge specifying that the client needs to supply a credential to the underlying storage and retry the request (regular www-authenticate stuff, but with more information about which resource we need the keys for). This way, the server only has access to the data the requesting user has access to.
  2. Use more traditional OAuth with JWT bearer tokens that contain scopes, and implement the Trino SystemAccessControl interface. In this setup, the server already has access to all the files, but it is capable of making access decisions at every level of granularity.
    • This is quite straightforward to implement at the catalog, schema, and table level
    • This interface also supports expression-based row and column filters. You basically supply SQL snippets that get inserted into the WHERE clause.

Even if you don't have time to set up a production-ready instance, it might be worthwhile considering a Trino-based Search implementation. We've provided an open source service that wraps Trino to provide the Search API, but honestly it's not that much work to make such an adapter from scratch in your preferred programming language.

ifokkema commented 3 years ago

Thanks a lot for the detailed response!

First I should ask, have you considered htsget? It might be a closer match for your requirements.

Thanks for the suggestion! Unfortunately, it doesn't match because it's VCF-based while LOVD is HGVS based, and I have more complex data objects with classifications, effects on RNA & protein, references, and phenotype information to share.

For example, these implementations of the Search API are just carefully named JSON files in cloud buckets:

I saw that in the docs indeed when I was wondering about the /tables vs /table/... URLs, I think it's quite brilliant.

If you're up for it, we could explore an extension to the pagination system together which would let consumers understand what partitions exist in the data, which attributes the data is already sorted by, and even how to jump directly to the pages of interest.

Sounds good, I'm in!

We've experimented with something very basic along these lines with the index attribute here.

Imagine something like this on a table, but also advertising information about partitions:

This looks good! I really like the idea. Some questions:

{
    "index": [
        {
            "ordered_by": [ "id" ],
            "pages": [{...}]
        },
        {
            "ordered_by": [ "position" ],
            "pages": [{...}]
        }
    ]
}

The meaning of the partition's min/max values for fields not sorted on isn't clear to me. For the field that the pages are sorted on, I understand it'll allow the client to determine for which id which page would need to be accessed, but for the subject.sex it doesn't provide any relevant information, does it?

Implementation-wise, I realize this means the entire database should already be pre-sorted and its pages/partitions predefined. This may be an issue in a live database with constant data manipulation going on, but I can solve this for me by just providing the chromosomes here. Not having the pages numbered means I can provide starting points without promising that there are no further pages until the next chromosome.

We could add mentions in the spec about conditional HTTP requests (eTag/If-Match and Date/If-Modified-Since) to help clients take advantage of the built-in caching behaviour of HTTP.

Great idea! Somehow I didn't think about using HTTP request headers to specify filters.

  1. Respond to a query request with an HTTP auth challenge specifying that the client needs to supply a credential to the underlying storage and retry the request (regular www-authenticate stuff, but with more information about which resource we need the keys for). This way, the server only has access to the data the requesting user has access to.

This would imply, in my case, that MySQL credentials need to be passed. Creating a separate (read-only) MySQL account for any LOVD user is unfortunately not feasible, but for other implementations, this could work very well.

  1. Use more traditional OAuth with JWT bearer tokens that contain scopes, and implement the Trino SystemAccessControl interface. In this setup, the server already has access to all the files, but it is capable of making access decisions at every level of granularity.

I need to think about this a bit more. As LOVD is distributed software, I'd like to have a solution that I can simply ship with LOVD and that won't require additional work from the sysadmin of the remote server. This will greatly benefit the adoption of new LOVD releases from the community and thus the implementation of the new API. Trino might make that hard, although I like the idea of not having to spend too much time on the API and using existing tools where possible.

Nice! LOVD's authorization is complex and needs additional tables to be joined depending on the user's level, but I'm sure there's a solution for that, too.

jfuerth commented 3 years ago

Thanks for your feedback. I've split out the new feature requests we've generated here. I look forward to continuing the discussion on those issues. Please open more if I have missed something!

ifokkema commented 3 years ago

Thanks for your feedback. I've split out the new feature requests we've generated here. I look forward to continuing the discussion on those issues. Please open more if I have missed something!

Excellent, thank you!