Technical requirements for API definition and implementation framework

The purpose of this ticket is to gather input for defining the missing technical requirements for the API schema definition and implementation (schema, methods, IDL, tool chain, etc). Please add your thoughts and identify the strength of each requirement as MUST, SHOULD, or NICE.

Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.

Interface definition:

MUST allows precise expression of the API interface, including objects and methods
SHOULD: does so in a machine-parseable way (i.e. the fewer comments like "this is optional" needed the better)

Wire format definition:

MUST: allows precise specification of one or more wire formats (e.g. JSON and some optimized-binary)
SHOULD: does so in a machine-parseable way, so implementors can automatically generate and parse network traffic

Implementation support:

MUST: doesn't require a particular implementation technology, on either the server or the client
SHOULD: includes optional supporting software to help with schema validation, wire handling, client libraries, etc.

General:

SHOULD: is widely-enough adopted that knowhow (e.g. on StackOverflow and Github) is available

This is very clean and on target.

On Sun, Jun 7, 2015 at 6:03 AM, David Glazer notifications@github.com wrote:

Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.

Interface definition:

MUST allows precise expression of the API interface, including objects and methods

SHOULD: does so in a machine-parseable way (i.e. the fewer comments like "this is optional" needed the better)

Wire format definition:

MUST: allows precise specification of one or more wire formats (e.g. JSON and some optimized-binary)

SHOULD: does so in a machine-parseable way, so implementors can automatically generate and parse network traffic

Implementation support:

MUST: doesn't require a particular implementation technology, on either the server or the client

SHOULD: includes optional supporting software to help with schema validation, wire handling, client libraries, etc.

General:

SHOULD: is widely-enough adopted that knowhow (e.g. on StackOverflow and Github) is available

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/323#issuecomment-109751019.

While I appreciate the importance of a precise API and data format definition that is machine readable and testable, I worry that this is premature. Perhaps we can resolve this problem like so:

We MUST have at least one working implementations of a given feature (data format or API endpoint) before it is accepted into the public spec.

There SHOULD be at least two working implementations. In some cases this may not be practical, and a single implementation can be accepted.

It would be NICE if we didn't strive to make a single integrated schema for everything we are trying to do. In time one may evolve. But I think we need to simplify the process as much as possible or we are going to spend another year discussing the presumed benefits of various organizations of modules, objects, and interfaces.

It would be NICE if we focused our efforts around providing public solutions to clearly-defined problems. The test regions are a start, and I hope we continue in that direction.

Also NICE: solving only problems that have presented themselves. For example, if we propose an entirely novel data model for representing pan-genomes, there better be real evidence that existing models don't work. That would mean we tried to do things with them and found they are impossible.

This is an excellent idea @diekhans, thanks for this. I agree with @dglazer's points.

I think we need to be more specific about several low-level things if we are going to have true interoperability:

Schema versioning: we MUST define and document how the client and the server negotiate a shared version.
Content type negotiation: similarly, we MUST formalise the wire formats we support and the mechanisms for client and servers to negotiate a shared wire protocol.
Error handling: We MUST formalise the documentation of error conditions and how they are communicated back to the client, including the appropriate HTTP status code. It would be NICE if this was done in a machine-parseable way, so that the expected error conditions for each method could be derived from the schema definitions.

Is REST the API model here? Or are we shooting for an old-school RPC style, like Avro specifies? Something in between?

I think that the API model should be (re)asserted before we go any further, since it's the context in which an IDL should be evaluated. I should put it in terms of the thread:

The DWG MUST assert an API model and the IDL SHOULD support that model.

The "SHOULD" above is a pressure release valve for mismatches between IDLs and API models. For example, Avro is explicitly non-REST but its schema language is pretty nice, and the serialization format is very nice. There's no perfect IDL and it's important to be pragmatic, so I think an important product of this discussion is a concrete explanation how the IDL is and isn't expected to be used. So maybe:

The DWG MUST document the use of the IDL in the development process

(BTW, props to @dglazer for finding the discussion where this was originally hashed out)

@bartgrantham one problem with asserting a ReST API is that there is no agreed upon definition of ReSTful. It goes from simply stateless' to1-to-1 correspondence to HTTP verbs'. We need to be very clear on what we are defining.

There are several interesting topics getting mixed together here, imo -- sorry I can't be there in person this week to help tease them out. One high-level observation -- I believe we're having this conversation now because we have more than one team actively doing server development, which is a very good thing. It's finding places we had differing implicit assumptions, and forcing us to make them explicit, which in turn will let more people down the road participate more easily. That's the whole point of standards and interoperability -- yay!

A few thoughts on the details:

Re @ekg 's comments:
- I agree that (paraphrasing) we should be clear on the problem before spending too much energy on the solution.
- However, I disagree that having a crisp well-defined API definition process is premature, at least for the parts of the API surface that we've been working on the longest -- namely Reads and Variants. In those cases, I think we do know the problem we're solving, there are one or more working implementations, and there are existing clearly-defined problems (e.g. every tool built using htsjdk) to tackle.
- I agree that not all parts of the API surface are at the same stage, and that we should strive to honor both halves of "rough consensus and running code" as part of defining new capabilities. That's worthy of a whole separate discussion, on another day.
Re @jeromekelleher 's added requirements:
- I like all three of those ideas (schema versioning, content type negotiation, regular error handling). I could argue that we could fall back on human-readable documentation to know which flavors are supported by which server, and therefore those requirements could be downgraded to SHOULD, but that's splitting hairs -- they're all clearly good things, that we want to do eventually, and that get more and more important as more servers come online.
Re @bartgrantham and @diekhans on API model:
- I agree we should be precise about the API model we're standardizing on, and the way we use our IDL.
- I also agree that "ReST or not" is over-simplified. We actually have a precise API model that's in use today in 0.5.1, with many different client apps able to use it to communicate to compliant servers. If pressed I'd describe that model as ReST-like, but I wouldn't defend that description too vigorously. I'd be more interested in conversations about why the current model is or isn't well-suited to particular use cases, and what we could do to make it better suited.

MUST - A compliant client or server must be fully implementable given only the GA4GH API specification and other referenced public, stable specifications.

I agree with all of the above, and it is great to see us push new boundaries such as through gRPC as recently updated by @calbach through his white paper. But we need to progress a little more systematically when we develop an API - especially for the cloud - as we also want it to be evolvable without causing breaking changes. I will go through each step in order, where each is labeled MUST, SHOULD, or NICE/OPTIONAL:

Defining Features and Goals [MUST HAVE]

Before considering the design and support architecture of an API, we want to list all the required features that this API provides - the more detailed the better. We also should list the goals of how it will be used and how those features fit in to provide these intended opportunities. This can include things such as:

1) The API should be scalable with ability to provide a throughput of 1 billion reads/second if necessary. 2) The API should contain validators. 3) The API should enable traceability. 4) The API should be configurable. 5) The API should enable pipelining where one can pipe data from one API component into another. 6) The API should define data-structures for optimal querying from samples as inputs, with associated datasets as outputs.

Ideally this should be as open to input and flexible as possible to match closest what GA4GH wants to accomplish in its goals via the API. Basically we should not only define how the API should operate but also its capabilities, such as:

1) The API should provide the ability to request ad-hoc aggregation of ReadGroups into a ReadGroupSet for on-the-fly alignments. 2) The API should provide the ability to request all datasets where there is a high chromosome deletion.

Acceptance Criteria via Behaviour-Driven Development [MUST HAVE]

Before designing the data models for the API, it is important to consider the behavior that we would expect for each of these features. Here we should try to use a standard such as Gherkin syntax, where we define the scenarios and what we would expect:

Feature: Retrieving Reads
  Scenario: Retrieving reads using a dataset id
    Given a dataset id
    When it is retrieved 
    Then a '200 OK' status is return
    Then then it is returned
    It should have a dataset id
    It should have a list of reads
    It should present it into a JSON format using schema X (lines 12-19)
    It should list the dataset id first
    It should nest the reads as in JSON schema X (lines 42-49)
    ...

All of these behaviours have to be defined first before proceeding to the next sections.

Information Flow Model and Underlying Data Models [MUST HAVE]

Based on the above features, goals, and behaviours we define next the Information Flow Models that would support those features. These can be in the form of UML diagrams, and through them we want to understand the flow of information and how the data models would be connected to each other. We want to be sure that we do not have to later adjust architectural changes to support something we forgot, but realized we needed. Ideally we limit ourselves to MUST/SHOULD. Any dependent domains and resources that would be required by any of these models would need to be associated here and specified in their implementation.

For each component of schema, a test has to be provided with the expected results. These would be just like unit tests using actual data - which can be referenced - and should include any additional definition/descriptions regarding processing and the output. For example, let's say I define a read in the schema, then that component would include a test like this:

[READ TEST 1]
TEST DESCRIPTION: This test will take a BAM file and produce the JSON format 
                  as defined in schema X lines 12-19.
INPUT:   BAM file 
           [LOCATION: http://.../some_file.bam]
PROGRAM: test_read2json.py -input INPUT -output formatted_reads.json 
           [LOCATION: http://.../test_read2json.py]
OUTPUT:  formatted_reads.json 
           [LOCATION: http://.../formatted_reads.json]
DOCUMENTATION: [LOCATION: http://.../test_read2json.html]
EXAMPLES:      [LOCATION: http://.../test_read2json.html]

Here the appropriate decisions for Thrift, Protobuf, Avro, etc have to be made to match the appropriate schema requirements and other necessary definitions (i.e. algorithms, etc.) to fully define and document the intent of the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

Architecture and Workflow

[NICE TO HAVE, or MUST if specifically requested in the Features/Goals section above]

Here it would be good to describe - in detail if possible - the components of the framework of the server architecture and that of the client, including their interaction. This should include information regarding security and authentication scheme, regarding resource and scope access. I'm very inclined to push the authentication scheme to a MUST, since this is ubiquitous among such APIs.

Error handling would explicitly be defined here. [MUST]

Resource Models [MUST HAVE]

Based on the above behaviours (BDD), we would need to define the required resource models which are:

Root Resource

These would be the point of contact to which the clients would connect to.

Discovery of Resources

These are the available point from which resource discovery would take place, and which would be available for accessing the API capabilities. Each of these API calls will be requesting something to be processed (i.e. query, RPC, etc.).

Resources

These can be defined by the functionality they perform (i.e. search resource, item retrieval resource, collection group resource, etc). A detailed description for each would need to provided, including the inputs and outputs. These would be like the test contracts above. Here the requirements for content-type negotiation and versioning would be explicitly defined.

API Resource Style

With each API capability above the following would need to be defined:

1) Contract Definition

a) What the resource represent? b) How does it fit with the Information Model?

2) API Classification (borrowed from the Richardson Maturity Model)

What type of resource is it and how is it classified? Below are the four types of levels for API classification:

Level 0: RPC oriented Level 1: Resource oriented Level 2: HTTP verbs Level 3: Hypermedia

We can mix them up in case some we find such a mix most optimal, though it would have to be justified and documented as for any API component.

3) Implementation Details and Wire Format

These would be the detailed representation of how it would function and the type of wire format it would utilize. Here would be the the details and definitions of how it interacts and communicates among different components of the infrastructure, as well as within the information model. This would include how it would be transmitted with the specific media type being implemented (i.e. binary JSON, etc) - including the associated frameworks/protocols (i.e. HTTP/2, etc) with reasoning on why it is most optimal. This should include information regarding why it was chosen, with implementation, throughput, and timing provided subsequently - (i.e. Is the wire format self-describing by embedding metadata into the payload?). With each API - and its associated data model(s) - would need to be associated at least one test with the required sample data, as well as the acceptance criteria of the test. The API should be built for extensibility and evolvability in mind.

Testability [MUST]

This would include all the required testing criteria for unit testing and other types of tests of all the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.

Hope it helps drive the discussion, Paul

ga4gh / ga4gh-schemas