ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Move datamodel to separate packages #1586

Open david4096 opened 7 years ago

david4096 commented 7 years ago

The layer of code that uses pysam to answer GA4GH methods could be made more portable.

We might package up the datamodel classes that require pysam as it is probably our heaviest dependency and then isolate the tests that just test these paths.

We could then run the datamodel and datadriven tests, which load up various VCF/BAM, etc. to guarantee we are translating properly from the data to our datamodel in an upstream repository. This reduces the testing surface of the server and better demonstrates usage of dependencies.

Then, when adding via the registry, one would specify the classname in the datamodel being used to access the resource. Managing the available class names would be the same as adding them to the proper namespace in the underlying package, and exposing the methods by name. (ga4gh.datamodel.HtsLibVariantSet provides search_variants, search_call_sets).

I provide this as a point of discussion and a place to offer experimental implementations. @andrewjesaitis has split out the variants datamodel in the past when prototyping a python grpc.

To do this properly we may need to decouple the identifier scheme to provide local identifiers within a set (get_variant_by_id). The assumptions of the ownership of a dataset will need to be removed, but some mechanism for randomly accessing variants will still need to be provided.

The server itself will have dependencies for managing its registry, performing authorization/authentication, and de/seriliazation. The datamodel classes provide the controllers for databases and files, and during testing will be imported. When running full stack tests, all of these modules will be put through their paces. But in the server we won't need to run the unit tests on the datamodel or the datadriven tests, as these will have been run elsewhere and guaranteed passing for any master-branch software.

This is an approach to solving https://github.com/ga4gh/server/issues/1102 . In principle we can provide the registry interface for adding classes by name without moving the datamodel code out. But without a nice developer interface to others wanting to create controllers that can satisfy API requests, it will be difficult to explain.

ga4gh_repo add-variantset registry.db dataset1 myvcf.vcf.gz
# Please install `ga4gh-pysam` to be able to access variant data in VCF.
ga4gh_repo add-rnaquantificationset registry.db dataset1 postgres://1.1.1.1/rsem
# Please install  `ga4gh-postgresql` to access data stored in postgres.

We can do our best to reason about the file types, etc. But the real power comes in for someone wanting to add a GA4GH API to their database. They create a package and call it ga4gh-mydatabase, add it to the datamodel namespace, and create a class called MyDatabase that exposes the controller functions in backend.

Then when adding to the registry they call ga4gh_repo add-variantset registry.db dataset1 myTableName -C ga4gh.datamodel.MyDatabase. This class is stored in the registry so that when a request comes in to access variants, the underlying method is run. This model is complicated if one considers one might like to have an adapter that allows one to add multiple variant sets. Supporting that feature will require a rethinking of how the registry is organized as multiple database calls would need to be merged to answer a SearchVariantSetsRequest.

Other than the RPC methods, I believe that "plugins" only need to expose a method like populateFromFile, which tries to create a sensible registry entry for the container from the file's metadata.

As a final note, I believe that registry management and ETL processes should be clearly separated. If there is an ETL process to create a ga4gh-sqlite RNAseq dataset, then that should be attached to the datamodel class, not the registry editor. If you are convinced that splitting out the datamodel will be worth the effort, you might also be convinced the registry editor could be packaged and versioned separately.

kozbo commented 7 years ago

There are several ideas here and I think it might help to separate them out. Let me try to summarize and make some comments

  1. Creating an abstract class layer at the data model with the characteristic that Pysam dependancies are encapsulated

    • This ties right in with the idea of modularizing our server implementation. Keeping our Pysam dependancies encapsulated in the lower layer makes a lot of sense to me. We will not need Pysam if we add data model class to access a VCF database. I will remind readers of a document I started to track the larger design discussion. Please feel free to contribute and comment on this document.
  2. Providing a mechanism to specify the data model class to use with the repo manager

    • This is an interesting idea but I think it has a couple of problems a. It further complicates an already too complicated repo manager command line structure b. There are other ways to specify which class to use that should also be considered; using a configuration file for instance.
  3. Attaching ETL processes to the relevant data model class

    • Yes, absolutely. And this is one of the big selling points to this architecture.
  4. Creating a testing mechanism to cover the different data model implementations

    • Yes, as you say, the tests must be designed for the layers. So we will have a set of tests that will validate that the data model works and another set of tests that are specific to that class's implementation.
andrewjesaitis commented 7 years ago

I'm all for this. I think the rub as @kozbo correctly identified is the registry management. One idea there would be to move to/add a more configuration-centric model of registry control. I'd imagine in settings.py|json|yml you would basically specify the file or directory path of the data being served, appropriate backend, and any necessary metadata. Then on startup, the server would verify that the structure specified match the registry. If not, it'd create/remove the required entries in the registry. An additional benefit you'd get is a portable "registry" (ie the config file) that could be packaged with the data.

david4096 commented 7 years ago

Thanks for the feedback! Ideally, the datamodel isolates the metadata gathering portion of verifying the structure. Then the resulting configuration files could be easily generated.

It seems like you both don't like the idea of having to specify the classname when adding via the registry. This is a bit confusing and perhaps we could structure the plugin architecture around URL schemes instead. So each connector registers a protocol. That way, instead of having to keep track of classnames the plugin author creates a pysam:// url scheme that can be interpreted at registry add.

Then, when you go to follow someone's directions online for setting up your server and it shows a url pattern, we can go and download the connector for them.

ga4gh_repo add-variantset registry.db pysam://myvcf.vcf.gz
# The pysam connector requires `ga4gh-pysam`, downloading...

I think that providing a better interface to modify the prepopulated values will be a huge win, unfortunately, I believe that the requirements for referential make setting relational metadata difficult (without some automation).

Other APIs solve this by creating a unified write layer so that one can create an application that will manage the curation, and lookups of related data. I think you're both moving towards standardizing the document ingestion interface, which I think will be a good step towards that. Having an endpoint that I can post that document to will open up a lot of features. I think it is a more sensible repo manager interface than the CLI we provide. No one ever wants to write JSON on the command line.

Maybe our process would improve if we first used the datamodel to generate a document that the registry manager knows how to handle, instead of just putting those values into the class model. Then the repo manager takes a configuration file generated by the data model, and no other flags. The process would be more like "prepare variant set", which generates a header file you can edit to change the metadata, and then add-container to the registry. The repo manager would just have to add/remove containers and read in the configuration to reason about what type of container was being added.