ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

It's not clear that the avro schema files are not required to be used in implementations ( / Is AVRO the right tool for the job?) #287

Closed Relequestual closed 7 years ago

Relequestual commented 9 years ago

I joined the Matchmaker Exchange group late last year. After reading up the work done by GA4GH, I found the avro schema files confusing.

After several working group calls and many emails, I managed to attain the information that the avro files themselves were not inteneded to be used in actual implementations (Cassie set me straight). One sort of exception to this, is the avro files were used to generate classes for the GA4GH server reference implementation.

(I don't want to get into a debate here over what we're using to document the schemas here, and avro is better at describing a whole protocol, but I'm not convinced it's best for defining and validating json.)

After discussions with others implementing the MME API, there's some confusion over the intension of the avro schema files. We should make it clear that they are for documentation, and you do not NEED to use them to develop your implementation of GA4GH API's.

delagoya commented 9 years ago

@Relequestual how would you want to promote that message (without losing the fact that the Avro schema define the specification)? Would you be willing to propose a PR with documentation changes?

diekhans commented 9 years ago

We have an bioinformatics contractor for six weeks starting today who is going to be working on improving the GA4GH schema documentation. The main goal is to get overview documentation that will be linked to the doc in the Avro.

If there are particular documentation issues that need to be considered, please write a issue and add the Documentation tag.

Angel Pizarro notifications@github.com writes:

@Relequestual how would you want to promote that message (without losing the fact that the Avro schema define the specification)? Would you be willing to propose a PR with documentation changes?

— Reply to this email directly or view it on GitHub.*

bartgrantham commented 9 years ago

I feel your pain, @Relequestual. I did a deep dive into Avro recently only to later discover that it's largely being used as a schema definition language, not so much for its data serialization or (thankfully!) RPC. This long thread has some discussion about the use of Avro and its limitations.

Relequestual commented 9 years ago

@delagoya I know it's sometimes considered bad form to report an issue without a proposed solution, but I'm not really sure on any ideas here. I felt that the issue still needed to be recognised though.

I'm using json schema to define the "over the wire" json format for the MME project, but I'm reluctant to push for this when support for the nicer features (newer draft of json-schema) is not there for perl yet. I'd like to spend some time updating the json-schema module (it's not mine, but I have access to update it now), but it's not a simple task, and there are other things which demand my time currently.

@bartgrantham I'm not sure what the best option is. I don't have a vast experience with using definition languages. I'm seeing some work done by the FAIR DATA project to do with json-schema and json-LD, which I think is VERY interesting.

What would be nice is if we could look at other projects that have had a simialr technical goal, and see what happend and why. I'm not aware of such project however, at least not in the current age.

diekhans commented 9 years ago

The goal of Avro is implementation of data serialization.

It would be really helpful if we could understand why it is believed that the intention of Avro in GA4GH is not for implementation.

This is very much not the case. Avro was chosen to develop schemas for data exchange. Due to it's goal of being a language-neutral API, it's functionality is limited and it's not ideal as a data modeling tool.

The goal of GA4GH is not data-in-JSON. Avro has a more efficient binary encode and this will be an alternative for exchange on some servers. I believe Google has/will provide ProtoBuf support.

dglazer commented 9 years ago

@diekhans , while I agree that Avro itself includes data serialization, my recollection is that, when we (the DWG) chose to use Avro for schema definition, we explicitly said that was not a decision about implementation or wire protocol; we were only choosing a format for "the syntax the task team will use for our discussions" of schema.

We can certainly decide to make a new choice going forward, and we've been running some very promising experiments with alternate wire formats recently, but afaik, every existing running implementation and client of the GA4GH APIs, including the compliance test, uses JSON-over-HTTP on the wire.

P.S. If anyone is interested in the history, see the machine-readable schema: Avro or protobuf? thread from March 2014.).

ekg commented 9 years ago

vg isn't (yet) compliant, but uses a compressed streaming protobuf protocol for serialization and schema to define its graph, alignments, and annotations. There is a json equivalent. As far as I know protobuf doesn't have a way to define methods. I prefer it that way. At this stage it is much easier to think about data. The things we want to do with it still aren't entirely clear to me. On May 21, 2015 3:16 PM, "David Glazer" notifications@github.com wrote:

@diekhans https://github.com/diekhans , while I agree that Avro itself includes data serialization, my recollection is that, when we (the DWG) chose to use Avro for schema definition, we explicitly said that was not a decision about implementation or wire protocol; we were only choosing a format for "the syntax the task team will use for our discussions" of schema.

We can certainly decide to make a new choice going forward, and we've been running some very promising experiments with alternate wire formats recently, but afaik, every existing running implementation and client of the GA4GH APIs, including the compliance test, uses JSON-over-HTTP on the wire.

P.S. If anyone is interested in the history, see the machine-readable schema: Avro or protobuf? https://groups.google.com/forum/?#!msg/dwgreadtaskteam/CsbVK-ZDf-Q/1hTd33HbILEJ thread from March 2014.).

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/287#issuecomment-104438011.

diekhans commented 9 years ago

Hi @dglazer ... interesting. What is important here is not what has been implement, but what are the design and goals.

I now see exactly what you (and this bug report) are taking about.

Many of us (who don't get to write code) think we are using Avro as intended. Most of us don't like the Arvo restrictions, but felt it was worth living with because of all the advantages of the Avro protocol and programming-language neutral libraries. We took the `we are using Avro' seriously.

Technically, this is all very addressable though content type negotiation and client side libraries that talk raw JSON, Avro protocol, ProtBuf, etc.

However it's a much huger problem that the design is not clear within the group.

Mark

David Glazer notifications@github.com writes:

@diekhans , while I agree that Avro itself includes data serialization, my recollection is that, when we (the DWG) chose to use Avro for schema definition, we explicitly said that was not a decision about implementation or wire protocol; we were only choosing a format for "the syntax the task team will use for our discussions" of schema.

We can certainly decide to make a new choice going forward, and we've been running some very promising experiments with alternate wire formats recently, but afaik, every existing running implementation and client of the GA4GH APIs, including the compliance test, uses JSON-over-HTTP on the wire.

P.S. If anyone is interested in the history, see the machine-readable schema: Avro or protobuf? thread from March 2014.).

— Reply to this email directly or view it on GitHub.*

ekg commented 9 years ago

Reading back to that thread on avro versus protobuf is interesting and very frustrating. I wasn't in the project then and it seems like the decision has come and gone.

In summary there were a few technical concerns with protobuf, such as the lack of maps and unions (both are supported in protobuf 3), and then the rest of the discussion focused on the political message that would be sent by choosing protobuf over a "true" open source project.

I get the point, and respect the sentiment. But I don't think it has had the desired effect. The problem is that choosing a very particular and little known schema language causes more friction than is relieved by the fact that it is supported by an open source foundation. The message that is sent is that the group is not paying attention to the rest of the tech community. I don't think that's what we want.

We still can clarify this. We don't need to have a single schema definition. What is important is a common set of nouns and verbs. These could be written in a text file. Where concepts are truly compatible it is not difficult to convert the actual data formats. On May 21, 2015 6:26 PM, "Mark Diekhans" notifications@github.com wrote:

Hi @dglazer ... interesting. What is important here is not what has been implement, but what are the design and goals.

I now see exactly what you (and this bug report) are taking about.

Many of us (who don't get to write code) think we are using Avro as intended. Most of us don't like the Arvo restrictions, but felt it was worth living with because of all the advantages of the Avro protocol and programming-language neutral libraries. We took the `we are using Avro' seriously.

Technically, this is all very addressable though content type negotiation and client side libraries that talk raw JSON, Avro protocol, ProtBuf, etc.

However it's a much huger problem that the design is not clear within the group.

Mark

David Glazer notifications@github.com writes:

@diekhans , while I agree that Avro itself includes data serialization, my recollection is that, when we (the DWG) chose to use Avro for schema definition, we explicitly said that was not a decision about implementation or wire protocol; we were only choosing a format for "the syntax the task team will use for our discussions" of schema.

We can certainly decide to make a new choice going forward, and we've been running some very promising experiments with alternate wire formats recently, but afaik, every existing running implementation and client of the GA4GH APIs, including the compliance test, uses JSON-over-HTTP on the wire.

P.S. If anyone is interested in the history, see the machine-readable schema: Avro or protobuf? thread from March 2014.).

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/287#issuecomment-104470647.

pgrosu commented 9 years ago

I agree with @ekg. Maybe besides the list of verbs/nouns, we can have a brainstorm of the different types of analysis and associated data-searches people would wish for now and into the future - without worrying about the implemented data-structures yet. It might be that people feel that data is not dynamic enough, but it can be. Would people be interested is associating/connecting sequences and processes on-the-fly - including annotation? The capabilities today in the area of computer science are much more flexible than before. So I feel the same way of performing this type of exploration of the needs/wants for data and analysis in order to adjust the schemas afterwards to accommodate these type of preferred capabilities.

nlwashington commented 9 years ago

watch this @cmungall @jnguyenx

massie commented 9 years ago

The Avro schema definitions can be used to generate code for your implementation, if you choose; however, it's not required. The schema is just a well-defined blueprint, so to speak, that allows for interoperability and common entities across implementations.

Everyone has their favorite schema language or serialization system and each system has strengths and weaknesses. Avro was chosen for because of its openness and interoperability. If you like protobuf better, there is protobuf compatibility. If you like Thrift better, there is Thrift compatibility. Since Avro is at the Apache Software Foundation, it is open to extension by any member of the GA4GH community. If you like, you can easily store Avro data inside Apache Parquet which is the current defacto standard column storage format in the big data space. You can checkout the GA4GH store for an example of how easy it is to store data in Parquet. In short, Avro gives the GA4GH community options and flexibility as we evolve. Having a well-defined data exchange format makes it easier to share data between GA4GH data repositories.

I don't think that we are "not paying attention to the rest of the tech community" at all. Quite the opposite. We are choosing technologies with strong, open communities that integrate well with open-source big data technologies.

@ekg, I don't understand what you mean by "We don't need a single schema definition. What is important is a common set of nouns and verbs. These could be written in a text file." What is a schema but a definition of nouns and verbs saved in a text file? What would be the advantage of defining our own GA4GH interface description language? We'd need to then create parsers, converters and adaptors for all the systems that are out there.

Avro, Thrift, Protobuf are tools for data management and data exchange and in particular, management and exchange of petabytes of data. For example, The Avro format is "splittable" so that it can be analyzed on distributed execution systems, like Spark, and distributed file systems, like the Hadoop Distributed File System (HDFS). Avro also handles schema evolution. If you have petabytes of genomic data, you don't want to reformat your data every time the schema changes. As long as schema updates are kept compatible, you can automatically read "old" and "new" data format together. There's no need for explicit versioning.

Avro data is self-descriptive. The schema is always stored with the data in a format that is human-readable and machine parseable (JSON). This is why Avro data can so easily be integrated into a number of different systems -- data readers can count on having access to the schema. The records, however, are stored in a compact binary format.

While Avro provides a syntax in the IDL for remote procedure calls (RPC), you could also easily convert these "verbs" into your favorite RESTful framework if you wanted. REST is absolutely horrible for data exchange, being request/response over a verbose text protocol, but REST can be useful for launching specific jobs on remote sites to analyze data in a repository.

We can easily get bogged down into a protracted debate about the best serialization and RPC system out there (punchline: there isn't one). A better approach is to choose one, however imperfect, and start making progress.

pgrosu commented 9 years ago

@massie, I agree with what you on schema evolution, but what you are describing is a data exchange format that could become quite heavy if the schema is dragged with multiple layers of nested data, which protobuf/Thrift could be more efficient. So in the data model definition and API deliverables for DWG, it suggests the following:

Data model. A data model is an abstract, mathematically complete and precise model of the data that is manipulated by the API.

...DWG will work to [...] develop formal data models, APIs and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion...

I'm not sure Avro is mathematically complete and methods (verbs) are difficult to be precisely described in terms of an algorithm. The columnar storage might be appropriate for some things, but would suffer during many inserts as more analysis would utilize existing data and add new post-processed results. I think we should push the boundaries of how we store data, their relations, and methods we attach to them among other things. Take this example I put together on a tuple approach - which can take the form of key-value based storage. This could be implemented for part of the schemas, while others can use other approaches such as columnar storage/Parquet:

Let's define a relationships as collections of tuples, where it would have a name, schema, associated data as the following form:

( name, schema, data )

Let's say we have the following stored as one leaf of many VariantSets that are replicated and sharded:

list( VariantSet, 
      {  
        string id;  
        string datasetId;
        string referenceSetId;
        array<VariantSetMetadata> metadata = [];
      },
      list('595f44fec1e92a71', '5127a7b60fa4d', '91bf6d6c3f263a8', ... ),
      list('595f44fec1e92134', '5127a7b667576', '1234er2c3f263a8', ... ),
      ...)

Now let's say we want to propagate a schema change, all we need to do is run a function over the list to change them appropriately.

> add.attribute( "string ChangedByUser [default=null], VariantSet )
> add.data( list('595f44fec1e92a71', '5127a7b60fa4d', '91bf6d6c3f263a8', [], 'Google'), VariantSet )
> propagate.function( updateUserForDatasets( 'Broad', list(datasetId1, ... ) ) , VariantSet)
> ReadAlignment.by.nextMatePosition <- transpose.and.fold( ReadAlignment, by=nextMatePosition)

The last one would generate a new stored relation that would have all unique nextMatePosition pointing to a list of ReadAlignments, which could also be implemented via a Map/Reduce approach.

Since this borrows heavily from the area of functional programming, the streaming would be fairly straightforward to implement in order to seek for specific areas, which in Avro might be difficult to model as @dglazer previously commented. There is a more complex version of the above, that can be implemented with the schemas being stored separately from the data (i.e. in a Schema MetaStore) and then called by reference.

I definitely agree with being proactive, but it would be good to periodically take a step back and look at the big picture and what impact it would have downstream, especially if limitations might arise.

Paul

delagoya commented 9 years ago

There is a principal in Amazon of "Disagree and commit." which I think we should follow here. It is a recognition that sometimes all of the options on the table are suboptimal for one reason or another, and a robust debate will elucidate those issues. Once the decision is made but the group, all parties must commit to it in full, and the issue is no longer up for debate.

@ekg and myself will be holding an open discussion about this topic during the plenary in Leiden. The slotted time is for Tuesday June 9 @ 9:30AM -10:30AM Local Leiden Time (which I believe is CEST or UTC +2 ).

If you are not able to attend the plenary, please email me at delagoya at gmail. We will collate the arguments above as well.

diekhans commented 9 years ago

As far as I can tell, there was never technical requirements gathered for a IDL/messaging/schema system. If that was done, would someone please point us at it. Many of us have no idea what was committed to or what the goals were.

Avro is currently failing us because we are misusing it. We don't use the Avro protocol and it is not designed to parse JSON that it didn't generate. The java and python JSON parsers behave differently, with the Java one not looking at the field names but assuming they are in the same order as the schema.

The reference server group has spent a lot of time working around Avro problems. They have had to implement functionality you would normally expect from a IDL.

Even worse, we have an API that fails to specify what the protocol is. If the Avro schemas are only documentation, as implied by the subject, then not having the protocol define in detail makes interchangeability nearly impossible.

Angel Pizarro notifications@github.com writes:

There is a principal in Amazon of "Disagree and commit." which I think we should follow here. It is a recognition that sometimes all of the options on the table are suboptimal for one reason or another, and a robust debate will elucidate those issues. Once the decision is made but the group, all parties must commit to it in full, and the issue is no longer up for debate.

@ekg and myself will be holding an open discussion about this topic during the plenary in Leiden. The slotted time is for Tuesday June 9 @ 9:30AM -10:30AM Local Leiden Time (which I believe is CEST or UTC +2 ).

If you are not able to attend the plenary, please email me at delagoya at gmail. We will collate the arguments above as well.

— Reply to this email directly or view it on GitHub.*

Relequestual commented 9 years ago

+1 what @diekhans said!

@delagoya Can we make sure that the floor is given first to those who have explicit experience in using these technologies in a current production environment? I feel there has been lots of speculation without any practical insight as to what works "on the ground". (I'm not including myself in that group, but I would hope there's someone who will be there who would be.)

If we WERE to move from using avro for the API documentation, then it's a BIG change. Is an open discussion the best form for this? Is 1 hour enough? (I don't have the experience to judge that, maybe it is and will be.)

awz commented 9 years ago

I can think of many people that I'd like to be part of the "Avro Objects" discussion but those people are some of the most busy in our group. I know @tetron will not be in Leiden what about @massie @fnothaft ?

fnothaft commented 9 years ago

I'll be at the plenary in Leiden.

skeenan commented 9 years ago

@Relequestual / all there is scope to extend the meeting to 12:30 if required.@awz might be helpful to canvas @tetron in person and represent his views at the meeting.

lh3 commented 9 years ago

There have been quite a few tools on querying variants and genotype data, such as Gemini, GQT, bcftools, SnpSift, my unfinished BGT, AAiM from Bina and recently Google Genomics via BigQuery. I think something in common is SQL-based or SQL-inspired flexible query, supporting arbitrary expressions with a wide range of meta annotations or user-defined fields. For example, Gemini can do a query select * from variants where is_lof=1 and in_dbsnp=0. Google Genomics shows some quite complex SQL examples. My view is these systems converge because users need this level of flexibility. Querying variants/genotypes is much more complex than retrieving a file by its name or an sequence entry by its accession. It is very difficult for avro to reach such flexibility. Most existing systems support much richer queries than our current API. Probably some may argue that we can decompose a complex query to a series of simple queries. This doesn't always work. Some SQL queries can't be decomposed without hurting efficiency and/or user experiences.

I guess what I want to say is that it might be good to jump out of our json circle for a moment and open our mind to other alternatives for variant/genotype query. For example, would a query language work (cc @akiani)? Is it possible at all to perform all queries in SQL (cc @mlin) even if the backend is not SQL-compatible? Perhaps we might be able to define a SQL schema requiring a core set of fields and allow the rest to be repo-dependent? Is it better or worse than our current design? Or are there other close-to-SQL or NoSQL/non-DB interfaces/languages that best meet users' needs?

akiani commented 9 years ago

Thanks @lh3 for bringing this up. I can definitely present our query language to you all on a call. In lack of a standard, we have gone with a query language we have created for our annotation querying engine... it's mostly a way to (visually/json) create a binary expression for filtering variants from different databases.

On a high level, it is something like:

AND:
     - criteria 1
     - criteria 2
     OR: 
           - criteria 3
           - criteria 4
                   - AND
                   ...
     - criteria 4

So it's basically a way to create a binary tree (inspired by the decision trees used for variant interpretation guidelines)

We do use a NoSQL backend because we mostly focus on large datasets such as WGS and many different databases so this works on top of our internal query engine.

I am not proposing we go with what I mentioned, but would love to help going towards a standard that we all can adopt (specially because I see this becoming an important integration point for downstream tools)

fnothaft commented 9 years ago

@lh3 Perhaps I'm misunderstanding what you're suggesting, but there are several query engines (Impala, Spark SQL, Hive, etc) that can natively execute queries on top of data stored in Avro. I think these engines may make some minor restrictions on the schemas (some of them are restricted to flat schemas), but otherwise can evaluate queries on top of binary Avro data.

IMO, there aren't any genomic queries that can't be expressed in plain SQL, but there are many that can be accelerated by extending SQL. E.g., my colleague @kozanitis has done work to extend the Spark SQL query optimizer with support for a primitive we refer to as an "overlap" or "region" join (join two tables where the "equality" operator is whether two objects overlap in a coordinate plane). If you don't make any optimizations, you need to fall back on doing a full cartesian product to evaluate this join. We have several variants of this (@kozanitis' one is the only one implemented in Spark SQL and uses interval trees; we have additional pure Spark implementations that are broadcast and sort-merge joins).

lh3 commented 9 years ago

I am not talking about backends at all. I am only thinking about the interface to users - what is the best way for users to get their desired results. Retrospectively, we GA4GH decided to adopt JSON from the very beginning with no one arguing any alternatives. I had little experience in variant/genotype query at that time, so followed others and took JSON for granted. Nonetheless, for my recent works, I started to study how others achieve such queries. Reviewing the existing works, I actually feel their interfaces are more convenient and flexible. Our current APIs can't do many queries implemented in the existing tools; even if we extend our API later, the interface will be clumsy in JSON so far as I can foresee. I think it would be good to take a step back and have a look at the progress outside GA4GH. Even if we think avro/json is the right way, we can probably learn something from others. At least I have learned a lot.

ekg commented 9 years ago

I don't think its the query language or api that matters for integration. There are just too many access patterns to try to standardize around. Nor is it the serialization format (file formats). There are different ways of expressing the same underlying information. So that is what is essential: what do we want to express fundamentally? In some sense we are still trying to figure this out.

If the group wants to make an impact, we should do as @lh3 suggests and learn from explorations in this space--- both ours and others. This will make it much more clear how we can reduce the friction of collaboration and data sharing. I worry that the schemas and their technicalities have been a distraction from this process. On Jun 5, 2015 11:28 PM, "Heng Li" notifications@github.com wrote:

I am not talking about backends at all. I am only thinking about the interface to users - what is the best way for users to get their desired results. Retrospectively, we GA4GH decided to take on JSON from the very beginning with no one arguing any alternatives. I had little experience in variant/genotype query at that time, so followed others and took JSON for granted. Nonetheless, for my recent works, I started to study how others achieve such queries. Reviewing the existing works, I actually feel their interfaces are more convenient and flexible. Our current APIs can't do many queries implemented in the existing tools; even if we extend our API later, the interface will be clumsy in JSON so far as I can foresee. I think it would be good to take a step back and have a look at the progress outside GA4GH. Even if we think avro/json is the right way, we can probably learn something from others. At least I have learned a lot.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/287#issuecomment-109445342.

pgrosu commented 9 years ago

@lh3 I agree and I think I posted regarding this on multiple occasions last year, but it didn't seem to get much traction. If you remember last year I started with Google Maps approach to genomic browsing (https://github.com/ga4gh/schemas/issues/57) for querying the genome, and then I simplified the genomic browser approach to querying across regions (https://github.com/ga4gh/schemas/pull/253#issuecomment-97172451) for rare mutations in cancer subtypes, which I posted several times.

I then posted (#131), which I tried again under this post (https://github.com/ga4gh/schemas/issues/145#issuecomment-56269836) to try to gain traction. Then we starting talking about possible infrastructure capabilities in (#150) to enable something like this. Then to help bring all the discussions together, I summarized all the ideas regarding our data models, goals and current field of NoSQL in the following post (https://github.com/ga4gh/schemas/issues/264#issuecomment-90081657) with how they would tie together with some examples. Then I talked about cloud-enabled queryable data-structures through the use of inverted indices, which currently web search engines use, etc. etc.

Keep in mind SQL basically has its foundation in relational algebra, which all queries in SQL can be built up through, from the combination of a few operations: projection (pi), selection (sigma) and natural join. There are a few more operations that you can read about in the link on relation algebra, for creating all the possible queries.

I agree that querying is important, but even the BigQuery approaches are still extremely limiting for what we really need as they are not self-updating and take time to load the data, rather than building and querying the necessary structured data on-the-fly. We do not want to only query based on subsets of records, but we want triggers attached to metadata and specific sequence variants as the data is streaming into the repositories and associating with recommended probable studies on the fly, while alerting us of them and their respective samples. Think of this, let's say a variant is automatically detected based on subsets of auto-triggered alignments of new data to subsets of reads in the system. We want that to alert us. Are there regions of stored graphs across the molecular evolution of cancer subtype samples similar to other ones and how do their RNAseq activity compare? We want that to alert us. We should be able to ask the system to build us a probabilistic molecular evolution model from some samples. Then we can inquire the system to let us know if there are there other modeled collections of k-samples, which correlate with parts of our model - or other models - and if their distribution became stationary as we add more samples - that is another alert. If there is a specific combination of words in the samples, we want that to alert us. If an improperly named sample contains genomic sequences that align to known samples for which we require more quantity of such samples, we want this to alert us. Facebook does this all the time for online (live) querying of posts to generate topical trends placed on a timeline during presidential speeches. Remember the Google Flu Trends? Thus what we want is much bigger than any of the currently implemented query systems out there. This requires a new design which I am still waiting for us to get to discuss. I think Facebook, Google, Yahoo, etc have accomplished amazing things with querying live (online) data for advertising and other domains (maps, video, images, repository synchronization, cross-domain integration, etc). We want to borrow those ideas and apply them here, since much of that has already been well researched and can have great benefits here, not only for research but for personalized medicine as well.

I'm really happy that we are starting this discussion.

Paul

fnothaft commented 9 years ago

@lh3

I am not talking about backends at all. I am only thinking about the interface to users - what is the best way for users to get their desired results.

I'm not talking about backends either. Rather, my point ("IMO, there aren't any genomic queries that can't be expressed in plain SQL, but there are many that can be accelerated by extending SQL.") was that many common genomic queries can be represented via SQL + UDF. The API we've been defining is oriented strictly towards data access, which makes the API very restrictive. Even computation over graphs can be represented via relational means, provided that you're willing to trade off the beauty of the API or the efficiency of the system.

FWIW, there are reasonably efficient graph systems built on top of dataflow systems. While dataflow systems are not strictly relational, they're reasonably similar.

Retrospectively, we GA4GH decided to adopt JSON from the very beginning with no one arguing any alternatives.

IIRC, there was a long discussion about REST + JSON vs. other approaches in the early phases of the GA4GH. I can't find an "optimal" link for this, but this link seems reasonably representative. Perhaps @dglazer or @massie could suggest a thread to review? I'm not sure how much of the discussion occurred on/off email.

@ekg

I don't think its the query language or api that matters for integration. There are just too many access patterns to try to standardize around. Nor is it the serialization format (file formats). There are different ways of expressing the same underlying information.

This is a core argument that we make in the ADAM paper. Simply put, a schema describes a logical view of data –> it is disconnected from the physical storage of that data. Any physical representation of the data (i.e., the serialization format; e.g., SAM/BAM, CRAM for reads; VCF/BCF, MAF for variants; BED, NarrowPeak, BigWig for features, etc) can be translated into a logical view of the schema (see slides 16-18). Once you've defined a logical view of the data, it is straightforward to define and optimize higher level query patterns.

So that is what is essential: what do we want to express fundamentally? I worry that the schemas and their technicalities have been a distraction from this process.

I think that the REST-based approach we've chosen has struggled because I don't think there is a reasonable way to define a set of fundamental query patterns for genomics. This is a huge, broad field...

Additionally, while I don't think that worry about the schemas themselves is misplaced—you can't define queries without a data model—I would agree that debating the technicalities of the schemas is generally going to be fruitless without people building on top of the schemas. While I enjoy the discussions around the nuances of the schemas (e.g., can we formally evaluate whether two X graphs are homomorphic? where X is one of "string", "adjacency", etc), I find the practical implementation and evaluation of systems to be more fruitful.

pgrosu commented 9 years ago

Heng (@lh3), one more very important thing, even though you are studying different approaches, please don't hesitate to ask for any clarification or if something might be possible regarding anything no matter how trivial or technical. Someone else probably might have the same question. We would be more than glad to expand to any level of detail. It is most important that everyone is in sync of today's possibilities in the area of computer science to draw upon, since then our models and APIs will be as complete, extensible and to the required level of optimality as possible. We don't want to reinvent something that already exist - or worse limit ourselves - because we might not be aware of something or other, but rather push the boundaries for storing, retrieving, transmitting, querying and processing NGS (meta)data based on current trends in the area of computer science. This is as good as time as any to review and to brainstorm, since we have one more year for our endeavor, which is plenty to make it as complete and good as possible.

So the key message I am trying to communicate here is please feel free to ask anything.

Have a great weekend! Paul

delagoya commented 9 years ago

I think that the discussion about possible query methods and query methodology is separate from the discussion of the schema definition format. As @fnothaft points out "a schema describes a logical view of data" and the system to query that logical view can be independent of the schema definition.

Can I propose to take the query/methods/methodology discussion to another thread?

For the discussion regarding schema definition argument, we want to be clear about what the schema constructs that are proving difficult in AVRO and discuss those. At the beginning of the meeting, we will present a straw man example of such examples and have a productive discussion to either find a suitable construct in AVRO, or identify it as a blocker issue, in which case then we will look at alternate formats. As @Relequestual stated above, changing schema formats is not a trivial undertaking, and we should be weighting this appropriately.

During the meeting, @ekg and I will act as moderators of the discussion. Since we only have a short time (a couple of hours maximum) we will actively steer the conversation to the above goals, and respectfully ask that the participants please save their words for arguments that they strongly want to state and have not been stated before in the meeting. Any line of discussion that seems to be veering too much off target may be tabled for later discussion on a separate thread.

mlin commented 9 years ago

I appreciate @delagoya's approach to wrangling a spiraling discussion, and look forward to seeing folks in Leiden :)

The spiraling discussion is profoundly important, and I hope we find the right forum for it. Folks such as @lh3 and @ekg are core prospective "customers" of a genomics API, and in the above comments I read basic questions about the thing's utility. I've raised such questions too but should not be taken as seriously.

diekhans commented 9 years ago

@ekg, you point out a huge problem we have: lack of common agreement on the goals.

Maybe of us think that the goal is to define a data exchange methodology. This requires exact definition of the data serialization format. Schemas and software stacks to implement the serialization format will great facility development of this format.

It sounds like you are suggesting the goal is only conceptual data model?

Erik Garrison notifications@github.com writes:

I don't think its the query language or api that matters for integration. There are just too many access patterns to try to standardize around. Nor is it the serialization format (file formats). There are different ways of expressing the same underlying information. So that is what is essential: what do we want to express fundamentally? In some sense we are still trying to figure this out.

If the group wants to make an impact, we should do as @lh3 suggests and learn from explorations in this space--- both ours and others. This will make it much more clear how we can reduce the friction of collaboration and data sharing. I worry that the schemas and their technicalities have been a distraction from this process. On Jun 5, 2015 11:28 PM, "Heng Li" notifications@github.com wrote:

I am not talking about backends at all. I am only thinking about the interface to users - what is the best way for users to get their desired results. Retrospectively, we GA4GH decided to take on JSON from the very beginning with no one arguing any alternatives. I had little experience in variant/genotype query at that time, so followed others and took JSON for granted. Nonetheless, for my recent works, I started to study how others achieve such queries. Reviewing the existing works, I actually feel their interfaces are more convenient and flexible. Our current APIs can't do many queries implemented in the existing tools; even if we extend our API later, the interface will be clumsy in JSON so far as I can foresee. I think it would be good to take a step back and have a look at the progress outside GA4GH. Even if we think avro/json is the right way, we can probably learn something from others. At least I have learned a lot.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/287#issuecomment-109445342.

— Reply to this email directly or view it on GitHub.*

delagoya commented 9 years ago

@diekhans you make a good point here, in that at least some of the problems that we are facing are due to not having a ready platform to gain immediate feedback on the schema constructs being discussed.

Perhaps it will be a good use of our time to spend the first 10 minutes of the session going over AVRO tooling that we have available, so that going forward we can use that to inform further discussions.

RE: Your point about the goal, it was my understanding that the majority of the task is defining a conceptual data model, with a nod to the practical aspects of using that model for real systems. Most standardization efforts do not pay attention to the last bit, and produce standards that are good for bulk transfer and archive of data, but not very good at operational use. We want to have our cake and eat it too ;-)

pgrosu commented 9 years ago

Heng (@lh3) here's one more thing for you reading list you might find useful:

Scuba: Diving into Data at Facebook

This is Facebook's Scuba for querying at webscale. Their API implementation is via Thrift though.

~p

diekhans commented 9 years ago

Here is an analysis by @wstidolph on some of the issues he has had attempting to use Avro in Java:

GA4GH JSON differs from Avro-JSON

pgrosu commented 9 years ago

@diekhans Is the document okay to edit? Things like "Conformance too Kit" probably should be "Conformance Tool Kit"

fnothaft commented 9 years ago

@diekhans @wstidolph The Avro spec documents that the output you observed is the "correct" Avro JSON. However, Avro's JSON representation != a general JSON schema; it is a mapping of Avro binary data, schema, and default fields into JSON. Although the GA4GH spec is defined in the Avro IDL, the spec is JSON over REST.

pgrosu commented 9 years ago

@diekhans @wstidolph Even though you could use Gson, you can still take the route of JSON<->POJO using Jackson, which you can even stream.

~p

diekhans commented 9 years ago

Thans @pgrosu, Fixed typo, however this is offered up as evidence of the problems being encountered, not a long lived document, so no need to edit.

Paul Grosu notifications@github.com writes:

@diekhans Is the document okay to edit? Things like "Conformance too Kit" probably should be "Conformance Tool Kit"

— Reply to this email directly or view it on GitHub.*

lh3 commented 9 years ago

As @fnothaft points out "a schema describes a logical view of data" and the system to query that logical view can be independent of the schema definition.

It is true that query methods can be independent of data models, but our schema goes beyond data models. It defines both data models and query methods. That is why we have many "*methods.avdl". It seems to me that our schema is actually closer to query methods than data models. We have added some redundant fields to the schema only to make query more convenient. More importantly, no one is using our schema to organize data, but we are trying to create an interface exactly defined by our schema. Our schema is not only a "logical view of data", but also an approach to accessing the data. It plays a double role.

I have always been confused by the double role of our schema. Now I am inclined to clearly separate the two roles: data models and query methods. Mixing the two has obscured our data model and hampered the exploration of more expressive query methods.

diekhans commented 9 years ago

Yes, @fnothaft, there in lies the problem. We have a Web API that doesn't define the wire protocol and we have a tool chain to implement it that we are misusing to the point it creates tons of work rather than saves work.

Frank Austin Nothaft notifications@github.com writes:

@diekhans @wstidolph The Avro spec documents that the output you observed is the "correct" Avro JSON. However, Avro's JSON representation != a general JSON schema; it is a mapping of Avro binary data, schema, and default fields into JSON.

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

@diekhans Ah, no problem - thanks :)

Mark Diekhans notifications@github.com writes:

Thans @pgrosu, Fixed typo, however this is offered up as evidence of the problems being encountered, not a long lived document, so no need to edit.

diekhans commented 9 years ago

I just create https://github.com/ga4gh/schemas/issues/323 to retroactively collect requirements for schema/API definition technology. Please add thoughts.

diekhans commented 9 years ago

Good points @lh3, I think we are ending up with more complex schemas because of assume a weak set of queries. For example, many object having pointers to there containers, in case they are needed, contrary to modern software practices.

IMHO, we will end up with a cleaner schema if we build functionality through queries rather than more complex data.

Heng Li notifications@github.com writes:

As @fnothaft points out "a schema describes a logical view of data" and the
system to query that logical view can be independent of the schema
definition.

It is true that query methods can be independent of data models, but our schema goes beyond data models. It defines both data models and query methods. That is why we have many "*methods.avdl". It seems to me that our schema is actually closer to query methods than data models. We have added some redundant fields to the schema only to make query more convenient. More importantly, no one is using our schema to organize data, but we are trying to create an interface exactly defined by our schema. Our schema is not only a "logical view of data", but also an approach to accessing the data. It plays a double role.

I have always been confused by the double role of our schema. Now I am inclined to clearly separate the two roles: data models and query methods. Mixing the two has obscured our data model and hampered the exploration of more expressive query methods.

— Reply to this email directly or view it on GitHub.*

bartgrantham commented 9 years ago

@lh wrote:

That is why we have many "*methods.avdl". It seems to me that our schema is actually closer to query methods than data models.

Completely agree. There's as much definition of "verbs" (queries) as there are "nouns" (the product of queries). I believe this is a consequence of Avro's RPC heritage.

Now I am inclined to clearly separate the two roles: data models and query methods.

Avro is at least three things at once: a schema definition language, a binary serialization format, and an RPC mechanism. It doesn't really define a query model, SQL or otherwise. I think it's very, very important that this discussion keep these things straight. How you define a schema, how you do serialization of data, how you express queries, and API mechanics are all non-orthogonal, but they are separate things.

I think a lot of the confusion is that people don't know which of these problems Avro is intended to solve. See you all tomorrow.

Relequestual commented 9 years ago

Post GA4GH plenary meeting update. Much discussion. Someone (sorry I forget who) tasked with looking at how much work it would be to move to protocol buffers and report back to the group.

@delagoya Anything to add to this specifically or should we wait till you've circulated the follow up document (which I think you said you would send round, correct me if I'm wrong).

pgrosu commented 9 years ago

Coooool!!! I'm really happy to hear this great news :)

delagoya commented 9 years ago

AVRO Session Summary

During the Leiden meeting, various members of the DWG had a lively discussion about our current methods of work. The goal of the session was to try to bring as much community comment as possible to flush out what has worked, what hasn’t and why we think it did not, and what the group will do about issues that we identified. The point of the discussion was specifically not to commit the larger group to a specific solution without input from the broader community.

Identified Issues:

  1. A lack of clarity about what the specification is actually specifying, and what is the “source of truth”. For example, are we specifically requiring AVRO JSON format (which has some differences from a standard JSON translation of the AVRO IDL Schema) as the source of truth and thus require that AVRO JSON be the on-the-wire format
  2. A lack of beginner documentation/guidance/software library support to enable them to move from beginner status to capable implementor
  3. There is a string desire from the community to be able to have access to the data model, and hence data, separate from the methods defined in the specification.
  4. There are serious issues being encountered with incompatibility between different AVRO language bindings
  5. There is a current lack of tooling around our schema and process to enable developers of the schema to test out proposed schema changes before putting to the larger group. In particular an implementation and validation framework is seen as crucial to further development of the schema and reference implementation
  6. We should have an index of implementations out there for reference
  7. Be clearer about what our goals are. Explicit identification of use cases and have we met them.
  8. There is a strong desire for a revamp release process
  9. There is a strong desire for adoption of real development branching models. For example, the current Reads API is lacking for some RNA-Seq use cases, and it would be handy if they could fork a specific release of the Reads API and develop what they need there, with an eye towards merging the necessary changes at a later date to bring it back into mainline. Similar story can be told for graph genomes team.
  10. We need to identify how much work would be really needed to shift away from AVRO to a proposed protobuf based workflow

There were several people in the meeting that stepped up and volunteered to tackle some of the above issues. They will start to enter each one into a separate issue, if it is not already an active issue.

We will leave this issue open for now, but please refrain from responding to the above in this thread. Wait a few days to allow folks who volunteered to enter the issue into github. If after June 15, 2015 you don’t see one of the above being addressed, feel free to make a new issue then.

david4096 commented 7 years ago

We are now firmly in protobuf ecosystem which has made serialization and deserialization across languages much clearer.