Open diekhans opened 9 years ago
Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.
Interface definition:
Wire format definition:
Implementation support:
General:
This is very clean and on target.
On Sun, Jun 7, 2015 at 6:03 AM, David Glazer notifications@github.com wrote:
Here's a first cut at requirements, based largely on what I remember us thinking the first time around. I don't expect the MUST's to be controversial; I do expect discussion on the content and priority of the SHOULDs, based on what we've learned to date.
Interface definition:
- MUST allows precise expression of the API interface, including objects and methods
- SHOULD: does so in a machine-parseable way (i.e. the fewer comments like "this is optional" needed the better)
Wire format definition:
- MUST: allows precise specification of one or more wire formats (e.g. JSON and some optimized-binary)
- SHOULD: does so in a machine-parseable way, so implementors can automatically generate and parse network traffic
Implementation support:
- MUST: doesn't require a particular implementation technology, on either the server or the client
- SHOULD: includes optional supporting software to help with schema validation, wire handling, client libraries, etc.
General:
- SHOULD: is widely-enough adopted that knowhow (e.g. on StackOverflow and Github) is available
— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/323#issuecomment-109751019.
While I appreciate the importance of a precise API and data format definition that is machine readable and testable, I worry that this is premature. Perhaps we can resolve this problem like so:
We MUST have at least one working implementations of a given feature (data format or API endpoint) before it is accepted into the public spec.
There SHOULD be at least two working implementations. In some cases this may not be practical, and a single implementation can be accepted.
It would be NICE if we didn't strive to make a single integrated schema for everything we are trying to do. In time one may evolve. But I think we need to simplify the process as much as possible or we are going to spend another year discussing the presumed benefits of various organizations of modules, objects, and interfaces.
It would be NICE if we focused our efforts around providing public solutions to clearly-defined problems. The test regions are a start, and I hope we continue in that direction.
Also NICE: solving only problems that have presented themselves. For example, if we propose an entirely novel data model for representing pan-genomes, there better be real evidence that existing models don't work. That would mean we tried to do things with them and found they are impossible.
This is an excellent idea @diekhans, thanks for this. I agree with @dglazer's points.
I think we need to be more specific about several low-level things if we are going to have true interoperability:
Is REST the API model here? Or are we shooting for an old-school RPC style, like Avro specifies? Something in between?
I think that the API model should be (re)asserted before we go any further, since it's the context in which an IDL should be evaluated. I should put it in terms of the thread:
The "SHOULD" above is a pressure release valve for mismatches between IDLs and API models. For example, Avro is explicitly non-REST but its schema language is pretty nice, and the serialization format is very nice. There's no perfect IDL and it's important to be pragmatic, so I think an important product of this discussion is a concrete explanation how the IDL is and isn't expected to be used. So maybe:
(BTW, props to @dglazer for finding the discussion where this was originally hashed out)
@bartgrantham one problem with asserting a ReST API is that there is no agreed upon definition of ReSTful. It goes from simply stateless' to
1-to-1 correspondence to HTTP verbs'. We need to be very clear on what we are defining.
There are several interesting topics getting mixed together here, imo -- sorry I can't be there in person this week to help tease them out. One high-level observation -- I believe we're having this conversation now because we have more than one team actively doing server development, which is a very good thing. It's finding places we had differing implicit assumptions, and forcing us to make them explicit, which in turn will let more people down the road participate more easily. That's the whole point of standards and interoperability -- yay!
A few thoughts on the details:
MUST - A compliant client or server must be fully implementable given only the GA4GH API specification and other referenced public, stable specifications.
I agree with all of the above, and it is great to see us push new boundaries such as through gRPC as recently updated by @calbach through his white paper. But we need to progress a little more systematically when we develop an API - especially for the cloud - as we also want it to be evolvable without causing breaking changes. I will go through each step in order, where each is labeled MUST, SHOULD, or NICE/OPTIONAL:
Before considering the design and support architecture of an API, we want to list all the required features that this API provides - the more detailed the better. We also should list the goals of how it will be used and how those features fit in to provide these intended opportunities. This can include things such as:
1) The API should be scalable with ability to provide a throughput of 1 billion reads/second if necessary. 2) The API should contain validators. 3) The API should enable traceability. 4) The API should be configurable. 5) The API should enable pipelining where one can pipe data from one API component into another. 6) The API should define data-structures for optimal querying from samples as inputs, with associated datasets as outputs.
Ideally this should be as open to input and flexible as possible to match closest what GA4GH wants to accomplish in its goals via the API. Basically we should not only define how the API should operate but also its capabilities, such as:
1) The API should provide the ability to request ad-hoc aggregation of ReadGroup
s into a ReadGroupSet
for on-the-fly alignments.
2) The API should provide the ability to request all datasets where there is a high chromosome deletion.
Before designing the data models for the API, it is important to consider the behavior that we would expect for each of these features. Here we should try to use a standard such as Gherkin syntax, where we define the scenarios and what we would expect:
Feature: Retrieving Reads
Scenario: Retrieving reads using a dataset id
Given a dataset id
When it is retrieved
Then a '200 OK' status is return
Then then it is returned
It should have a dataset id
It should have a list of reads
It should present it into a JSON format using schema X (lines 12-19)
It should list the dataset id first
It should nest the reads as in JSON schema X (lines 42-49)
...
All of these behaviours have to be defined first before proceeding to the next sections.
Based on the above features, goals, and behaviours we define next the Information Flow Models that would support those features. These can be in the form of UML diagrams, and through them we want to understand the flow of information and how the data models would be connected to each other. We want to be sure that we do not have to later adjust architectural changes to support something we forgot, but realized we needed. Ideally we limit ourselves to MUST/SHOULD. Any dependent domains and resources that would be required by any of these models would need to be associated here and specified in their implementation.
For each component of schema, a test has to be provided with the expected results. These would be just like unit tests using actual data - which can be referenced - and should include any additional definition/descriptions regarding processing and the output. For example, let's say I define a read in the schema, then that component would include a test like this:
[READ TEST 1]
TEST DESCRIPTION: This test will take a BAM file and produce the JSON format
as defined in schema X lines 12-19.
INPUT: BAM file
[LOCATION: http://.../some_file.bam]
PROGRAM: test_read2json.py -input INPUT -output formatted_reads.json
[LOCATION: http://.../test_read2json.py]
OUTPUT: formatted_reads.json
[LOCATION: http://.../formatted_reads.json]
DOCUMENTATION: [LOCATION: http://.../test_read2json.html]
EXAMPLES: [LOCATION: http://.../test_read2json.html]
Here the appropriate decisions for Thrift, Protobuf, Avro, etc have to be made to match the appropriate schema requirements and other necessary definitions (i.e. algorithms, etc.) to fully define and document the intent of the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.
Here it would be good to describe - in detail if possible - the components of the framework of the server architecture and that of the client, including their interaction. This should include information regarding security and authentication scheme, regarding resource and scope access. I'm very inclined to push the authentication scheme to a MUST, since this is ubiquitous among such APIs.
Error handling would explicitly be defined here. [MUST]
Based on the above behaviours (BDD), we would need to define the required resource models which are:
Root Resource
These would be the point of contact to which the clients would connect to.
Discovery of Resources
These are the available point from which resource discovery would take place, and which would be available for accessing the API capabilities. Each of these API calls will be requesting something to be processed (i.e. query, RPC, etc.).
Resources
These can be defined by the functionality they perform (i.e. search resource, item retrieval resource, collection group resource, etc). A detailed description for each would need to provided, including the inputs and outputs. These would be like the test contracts above. Here the requirements for content-type negotiation and versioning would be explicitly defined.
API Resource Style
With each API capability above the following would need to be defined:
1) Contract Definition
a) What the resource represent? b) How does it fit with the Information Model?
2) API Classification (borrowed from the Richardson Maturity Model)
What type of resource is it and how is it classified? Below are the four types of levels for API classification:
Level 0: RPC oriented Level 1: Resource oriented Level 2: HTTP verbs Level 3: Hypermedia
We can mix them up in case some we find such a mix most optimal, though it would have to be justified and documented as for any API component.
3) Implementation Details and Wire Format
These would be the detailed representation of how it would function and the type of wire format it would utilize. Here would be the the details and definitions of how it interacts and communicates among different components of the infrastructure, as well as within the information model. This would include how it would be transmitted with the specific media type being implemented (i.e. binary JSON, etc) - including the associated frameworks/protocols (i.e. HTTP/2, etc) with reasoning on why it is most optimal. This should include information regarding why it was chosen, with implementation, throughput, and timing provided subsequently - (i.e. Is the wire format self-describing by embedding metadata into the payload?). With each API - and its associated data model(s) - would need to be associated at least one test with the required sample data, as well as the acceptance criteria of the test. The API should be built for extensibility and evolvability in mind.
This would include all the required testing criteria for unit testing and other types of tests of all the APIs. It is critical that any additions/enhancements to the schemas would require at least one test to be added and all previous tests would need to pass - including any security-based tests.
Hope it helps drive the discussion, Paul
The purpose of this ticket is to gather input for defining the missing technical requirements for the API schema definition and implementation (schema, methods, IDL, tool chain, etc). Please add your thoughts and identify the strength of each requirement as MUST, SHOULD, or NICE.