ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Technical requirements for API definition and implementation framework #323

Open diekhans opened 9 years ago

diekhans commented 9 years ago

The purpose of this ticket is to gather input for defining the missing technical requirements for the API schema definition and implementation (schema, methods, IDL, tool chain, etc). Please add your thoughts and identify the strength of each requirement as MUST, SHOULD, or NICE.

dglazer commented 9 years ago

Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.

Interface definition:

Wire format definition:

Implementation support:

General:

haussler commented 9 years ago

This is very clean and on target.

On Sun, Jun 7, 2015 at 6:03 AM, David Glazer notifications@github.com wrote:

Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.

Interface definition:

  • MUST allows precise expression of the API interface, including objects and methods
  • SHOULD: does so in a machine-parseable way (i.e. the fewer comments like "this is optional" needed the better)

Wire format definition:

  • MUST: allows precise specification of one or more wire formats (e.g. JSON and some optimized-binary)
  • SHOULD: does so in a machine-parseable way, so implementors can automatically generate and parse network traffic

Implementation support:

  • MUST: doesn't require a particular implementation technology, on either the server or the client
  • SHOULD: includes optional supporting software to help with schema validation, wire handling, client libraries, etc.

General:

  • SHOULD: is widely-enough adopted that knowhow (e.g. on StackOverflow and Github) is available

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/323#issuecomment-109751019.

ekg commented 9 years ago

While I appreciate the importance of a precise API and data format definition that is machine readable and testable, I worry that this is premature. Perhaps we can resolve this problem like so:

We MUST have at least one working implementations of a given feature (data format or API endpoint) before it is accepted into the public spec.

There SHOULD be at least two working implementations. In some cases this may not be practical, and a single implementation can be accepted.

It would be NICE if we didn't strive to make a single integrated schema for everything we are trying to do. In time one may evolve. But I think we need to simplify the process as much as possible or we are going to spend another year discussing the presumed benefits of various organizations of modules, objects, and interfaces.

It would be NICE if we focused our efforts around providing public solutions to clearly-defined problems. The test regions are a start, and I hope we continue in that direction.

Also NICE: solving only problems that have presented themselves. For example, if we propose an entirely novel data model for representing pan-genomes, there better be real evidence that existing models don't work. That would mean we tried to do things with them and found they are impossible.

jeromekelleher commented 9 years ago

This is an excellent idea @diekhans, thanks for this. I agree with @dglazer's points.

I think we need to be more specific about several low-level things if we are going to have true interoperability:

  1. Schema versioning: we MUST define and document how the client and the server negotiate a shared version.
  2. Content type negotiation: similarly, we MUST formalise the wire formats we support and the mechanisms for client and servers to negotiate a shared wire protocol.
  3. Error handling: We MUST formalise the documentation of error conditions and how they are communicated back to the client, including the appropriate HTTP status code. It would be NICE if this was done in a machine-parseable way, so that the expected error conditions for each method could be derived from the schema definitions.
bartgrantham commented 9 years ago

Is REST the API model here? Or are we shooting for an old-school RPC style, like Avro specifies? Something in between?

I think that the API model should be (re)asserted before we go any further, since it's the context in which an IDL should be evaluated. I should put it in terms of the thread:

The "SHOULD" above is a pressure release valve for mismatches between IDLs and API models. For example, Avro is explicitly non-REST but its schema language is pretty nice, and the serialization format is very nice. There's no perfect IDL and it's important to be pragmatic, so I think an important product of this discussion is a concrete explanation how the IDL is and isn't expected to be used. So maybe:

(BTW, props to @dglazer for finding the discussion where this was originally hashed out)

diekhans commented 9 years ago

@bartgrantham one problem with asserting a ReST API is that there is no agreed upon definition of ReSTful. It goes from simply stateless' to1-to-1 correspondence to HTTP verbs'. We need to be very clear on what we are defining.

dglazer commented 9 years ago

There are several interesting topics getting mixed together here, imo -- sorry I can't be there in person this week to help tease them out. One high-level observation -- I believe we're having this conversation now because we have more than one team actively doing server development, which is a very good thing. It's finding places we had differing implicit assumptions, and forcing us to make them explicit, which in turn will let more people down the road participate more easily. That's the whole point of standards and interoperability -- yay!

A few thoughts on the details:

diekhans commented 9 years ago

MUST - A compliant client or server must be fully implementable given only the GA4GH API specification and other referenced public, stable specifications.

pgrosu commented 9 years ago

I agree with all of the above, and it is great to see us push new boundaries such as through gRPC as recently updated by @calbach through his white paper. But we need to progress a little more systematically when we develop an API - especially for the cloud - as we also want it to be evolvable without causing breaking changes. I will go through each step in order, where each is labeled MUST, SHOULD, or NICE/OPTIONAL:

Defining Features and Goals [MUST HAVE]

Before considering the design and support architecture of an API, we want to list all the required features that this API provides - the more detailed the better. We also should list the goals of how it will be used and how those features fit in to provide these intended opportunities. This can include things such as:

1) The API should be scalable with ability to provide a throughput of 1 billion reads/second if necessary. 2) The API should contain validators. 3) The API should enable traceability. 4) The API should be configurable. 5) The API should enable pipelining where one can pipe data from one API component into another. 6) The API should define data-structures for optimal querying from samples as inputs, with associated datasets as outputs.

Ideally this should be as open to input and flexible as possible to match closest what GA4GH wants to accomplish in its goals via the API. Basically we should not only define how the API should operate but also its capabilities, such as:

1) The API should provide the ability to request ad-hoc aggregation of ReadGroups into a ReadGroupSet for on-the-fly alignments. 2) The API should provide the ability to request all datasets where there is a high chromosome deletion.

Acceptance Criteria via Behaviour-Driven Development [MUST HAVE]

Before designing the data models for the API, it is important to consider the behavior that we would expect for each of these features. Here we should try to use a standard such as Gherkin syntax, where we define the scenarios and what we would expect:

Feature: Retrieving Reads
  Scenario: Retrieving reads using a dataset id
    Given a dataset id
    When it is retrieved 
    Then a '200 OK' status is return
    Then then it is returned
    It should have a dataset id
    It should have a list of reads
    It should present it into a JSON format using schema X (lines 12-19)
    It should list the dataset id first
    It should nest the reads as in JSON schema X (lines 42-49)
    ...

All of these behaviours have to be defined first before proceeding to the next sections.

Information Flow Model and Underlying Data Models [MUST HAVE]

Based on the above features, goals, and behaviours we define next the Information Flow Models that would support those features. These can be in the form of UML diagrams, and through them we want to understand the flow of information and how the data models would be connected to each other. We want to be sure that we do not have to later adjust architectural changes to support something we forgot, but realized we needed. Ideally we limit ourselves to MUST/SHOULD. Any dependent domains and resources that would be required by any of these models would need to be associated here and specified in their implementation.

For each component of schema, a test has to be provided with the expected results. These would be just like unit tests using actual data - which can be referenced - and should include any additional definition/descriptions regarding processing and the output. For example, let's say I define a read in the schema, then that component would include a test like this:

[READ TEST 1]
TEST DESCRIPTION: This test will take a BAM file and produce the JSON format 
                  as defined in schema X lines 12-19.
INPUT:   BAM file 
           [LOCATION: http://.../some_file.bam]
PROGRAM: test_read2json.py -input INPUT -output formatted_reads.json 
           [LOCATION: http://.../test_read2json.py]
OUTPUT:  formatted_reads.json 
           [LOCATION: http://.../formatted_reads.json]
DOCUMENTATION: [LOCATION: http://.../test_read2json.html]
EXAMPLES:      [LOCATION: http://.../test_read2json.html]

Here the appropriate decisions for Thrift, Protobuf, Avro, etc have to be made to match the appropriate schema requirements and other necessary definitions (i.e. algorithms, etc.) to fully define and document the intent of the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

Architecture and Workflow
[NICE TO HAVE, or MUST if specifically requested in the Features/Goals section above]

Here it would be good to describe - in detail if possible - the components of the framework of the server architecture and that of the client, including their interaction. This should include information regarding security and authentication scheme, regarding resource and scope access. I'm very inclined to push the authentication scheme to a MUST, since this is ubiquitous among such APIs.

Error handling would explicitly be defined here. [MUST]

Resource Models [MUST HAVE]

Based on the above behaviours (BDD), we would need to define the required resource models which are:

Root Resource

These would be the point of contact to which the clients would connect to.

Discovery of Resources

These are the available point from which resource discovery would take place, and which would be available for accessing the API capabilities. Each of these API calls will be requesting something to be processed (i.e. query, RPC, etc.).

Resources

These can be defined by the functionality they perform (i.e. search resource, item retrieval resource, collection group resource, etc). A detailed description for each would need to provided, including the inputs and outputs. These would be like the test contracts above. Here the requirements for content-type negotiation and versioning would be explicitly defined.

API Resource Style

With each API capability above the following would need to be defined:

1) Contract Definition

a) What the resource represent? b) How does it fit with the Information Model?

2) API Classification (borrowed from the Richardson Maturity Model)

What type of resource is it and how is it classified? Below are the four types of levels for API classification:

Level 0: RPC oriented Level 1: Resource oriented Level 2: HTTP verbs Level 3: Hypermedia

We can mix them up in case some we find such a mix most optimal, though it would have to be justified and documented as for any API component.

3) Implementation Details and Wire Format

These would be the detailed representation of how it would function and the type of wire format it would utilize. Here would be the the details and definitions of how it interacts and communicates among different components of the infrastructure, as well as within the information model. This would include how it would be transmitted with the specific media type being implemented (i.e. binary JSON, etc) - including the associated frameworks/protocols (i.e. HTTP/2, etc) with reasoning on why it is most optimal. This should include information regarding why it was chosen, with implementation, throughput, and timing provided subsequently - (i.e. Is the wire format self-describing by embedding metadata into the payload?). With each API - and its associated data model(s) - would need to be associated at least one test with the required sample data, as well as the acceptance criteria of the test. The API should be built for extensibility and evolvability in mind.

Testability [MUST]

This would include all the required testing criteria for unit testing and other types of tests of all the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

Hope it helps drive the discussion, Paul