ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Dockerize ga4gh server + example data #305

Closed macieksmuga closed 9 years ago

macieksmuga commented 9 years ago

Once the first stable release is reached.

dcolligan commented 9 years ago

What are the use cases of this? Is this just to save users the trouble of having to download example data from a (jerome's) random server?

macieksmuga commented 9 years ago

For the v0.5.1server, this will at first be a test of capability.

However the intention is to deploy the server on realistically sized chunks of genomic data soon, and It will make sense to investigate various ways (docker containers being a prime candidate) of deploying the server+data across clusters/on the cloud.

For the future v0.6 server, the datasets will likewise start growing fast, and they'll be spread across Sqlite databases as well as large FASTA files - so dockerization may be the right approach for deployment there too.

@jeromekelleher does this sound about right to you?

pgrosu commented 9 years ago

@macieksmuga, is the use of Sqlite dbs been set in stone? I only mention this because bottlenecks can appear, one specifically is the following:

http://www.sqlite.org/faq.html#q19

~p

macieksmuga commented 9 years ago

@pgrosu the graph genome project will be using Sqlite for rapid development and minimizing of compatibility issues in a reference implementation, not necessarily for a final full-scale server: So yeah, it's not ideal for real-world deployment against a full-size population genome-sized dataset.

That being said, it shouldn't be too bad for small-scale deployments: The specific bottleneck you mention shouldn't be a problem, as the server will be read-only for the time being. And my experience with SQL database servers is that they can be nicely optimized to fit particular patterns of use as they become apparent.

pgrosu commented 9 years ago

@macieksmuga, sounds good and agreed, though performance of concurrency for read-only can also be affected which depends on the configuration. Below is a link to different configurations and how to maximize concurrency for read-only:

http://manski.net/2012/10/sqlite-performance/

Hope it helps, ~p

diekhans commented 9 years ago

All database can encounter bottlenecks. The fact that the database will be read-only is a huge advantage in any approach. It's a good choice for a reference implementation.

Paul Grosu notifications@github.com writes:

@macieksmuga, is the use of Sqlite dbs been set in stone? I only mention this because bottlenecks can appear, one specifically is the following:

http://www.sqlite.org/faq.html#q19

~p

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

@diekhans, I agree for now, but definitely not for the near future. If we look just at the kilobases of sequencing per day per machine from a few years ago - which I'm sure will not plateau anytime soon, as things get more "affordable" - I imagine this data will feed eventually into GA4GH repositories on a daily basis:

sequencing

pgrosu commented 9 years ago

@diekhans, I'm thinking more specifically the alignment to custom transcriptomes, so I imagine that write performance would also need to scale for that. I sort of have a vision for GA4GH as being quite dynamic and customizable. As new data feeds into the repositories, then previously generated summary analysis and annotations for multiple datasets would be performed online and updated accordingly in a dynamic fashion.

Paul

diekhans commented 9 years ago

This is for the reference server. The goal for the reference server is validate the API, develop a conformance suite, and allowing for the development applications. It needs to be easy to understand and to enhance.

It's not intended to be a high-performance server and will fail in it's goals if we try to make it one.

Other high-performance, parallel, distributed servers will be developed. Google is already doing it. Hopefully there will be multiple open source ones.

Paul Grosu notifications@github.com writes:

@diekhans, I agree for now, but definitely not for the near future. If we look just at the kilobases of sequencing per day per machine from a few years ago - which I'm sure will not plateau anytime soon, as things get more "affordable" - I imagine this data will feed eventually into GA4GH repositories on a daily basis:

sequencing

— Reply to this email directly or view it on GitHub.*

fnothaft commented 9 years ago

+1 @diekhans

pgrosu commented 9 years ago

I apologize ahead of time, but this will be a little long. To design for scalability, throughput, and framework performance/functionality is something that would need to be clearly thought out before doing any coding. I feel we are building it on the fly. I know you are worried we would fail if our scope is too large, but we should not fail if we clearly write out our plans for the implementation specifics before we code anything. If we write out clearly our function contract definitions, with UML diagrams before we start coding, then we can clearly evaluate the whole design approach. I feel we keep updating the code as we go along, but if we step back and evaluate the whole design it might be a fruitful exercise. I was always surprised, and benefited greatly by going through that whole-system view evaluation.

Having been part of the discussions on Github for almost a year, it is very confusing of how the GA4GH project is progressing. Cassie (@cassiedoll) made it easy for me to understand what was going day-to-day based on the PRs and issues. For the past two months I have not seen one post from her. There used to be other folks who contributed in the past, and now I don't see much from them either. David (@dglazer) used to systematize things quite nicely last year, which helped me stay in sync, and for months I saw nothing until just a few days ago. Currently PRs and issues on schemas seem to pop up either being related to some PRs, or just out of the blue. Besides the server implementation, it feels like the GA4GH project doesn't have the same momentum that it did a year ago, since it is desperately needed. We have expertise here that was built over decades, and we can do so much more with it. Even sometimes what is going on in schemas and server is not in sync.

I've been part of large projects before, but it's very hard to tell where we are and how in sync we are with the GA4GH's original goals. I'm a fairly systematic and organized about problems/tasks, but for the past year that I've been in these discussions I am still trying to find a central project document/web-page of a plan with action items, that clearly states what has been done, and what needs to be accomplished next by the different members of all the different teams. I found different teams just because someone luckily periodically posted something related to one team or another. For instance, if we look at Erik's (@ekg) action list for variant graphs tools project, there is clear list of in the Development area of what has been accomplished and what still needs to be accomplished. We desperately need a central document/web-page that is posted publicly to sync up everyone on what is in the works, what has been done and what needs to be done next from each team - including how to get to the repositories/discussion lists for all the different teams.

GA4GH has fairly ambitious goals, which I naturally connected with right away, but I see a deviation in some parts of the scope of the project and some of its implementations. From the Priorities 2014 04 28 Final for Posting.pdf document, it clearly says the following in the first paragraph:

"Therefore over the next two years, the Data Working Group (DWG) will work with the international community to develop formal data models, APIs and reference implementations of those APIs for representing, submitting, exchanging, querying, and analyzing genomic data in a scalable and potentially distributed fashion, consistent with the security model developed by the SWG, the clinical data framework developed by the CWG, and the International Code of Conduct for Genomic and Health-Related Data Sharing developed by the REWG."

A proper high-performance, scalable and distributed API enables different opportunities, which previously were not possible, which this project requires based on the GA4GH mission statement. Though the server project is impressive in what it is accomplishing, GA4GH is geared for high-performance not a local client/server, since the throughput is just orders of magnitude larger. And as you saw as well recently (reads-client/server-discussion), we are starting to shrink the scope of some of the features, which is counter-intuitive to the goals for GA4GH. It is illogical to try to attack a Big Data problem with a small design, especially when the algorithms and resources are more available than before. This has been a disconnect for me for almost a year, when we have a lot of great expertise in this project to bridge that gap.

I know generally the implementation that Google is going for, but we really need to put our minds together to have planning and coherency between the schemas, teams and client implementations - or we end up with discussions like reads-client/server-discussion after the fact, where huge amounts of data and analysis already have been performed on earlier schema implementations that could easily have been foreseen and prevented. Framework design and implementation in a software development life cycle has deep integration and check-ins between the different teams. For this, discussions via issues and PRs are not enough. We need a clear central web-page where one can come in and see where all the teams and action items currently are. I was really hoping to see more from the research teams at Google about approaches we can take, since having read a good majority of their papers, they have deep parallels to what we are trying to do. This can be accomplished with open-source software that could be tweaked for this project. Yes I know that the API is just a definition, but we need to keep the definitions open to what is currently possible. And then we can do so much more with the server implementation.

If we just create definitions and test them locally, then how can we prove that we have something more interesting than what a large centers don't already have in-house. We have expertise here to completely change the field of NGS data integration and analysis. I think we need some coherency in the project, tasks, teams and scalable implementations, or we run the risk of coming up short of what is possible, and build something that has been built before.

This was a little lengthy, but we are half-way through the two-year GA4GH initiative. Maybe it might be helpful to evaluate what we did so far, where we need to be and explore all avenues we can take. We might surprise ourselves of what might be possible.

Let me know what you think.

Thank you, Paul

jeromekelleher commented 9 years ago

@macieksmuga - sounds about right to me.

@dcolligan, the main advantage I would see in making Docker images would be avoid the hassle of configuring apache and so on, which is actually quite tedious.

@diekhans - 100% agreed on your summary.

fnothaft commented 9 years ago

@pgrosu

It is illogical to try to attack a Big Data problem with a small design, especially when the algorithms and resources are more available than before.

I think you're looking at the reference implementation of the server and trying to generalize across the project. The server reference implementation is intended to provide a GA4GH server that is:

  1. Fully compliant with the GA4GH spec
  2. Easy to set up
  3. Free

I.e., it is a tool that allows people to easily set up a GA4GH server sitting on top of their current infrastructure, for free. Additionally, it is a testbed for API development, e.g., for iterating on the graphical variants query model.

There are several distributed implementations of the GA4GH protocols that exist or are in development. The Google GA4GH API is one, and I know folks working on a Spark-based OSS GA4GH server implementation.

Additionally, keep in mind that outside of the major sequencing centers, most people have "small" amounts of genomic data. Making a server that is easy to run on moderate sized genomic datasets solves the 95% case by enabling genomic data sharing for people who have moderate sized genomic datasets.

ekg commented 9 years ago

@pgrosu brought me here by mentioning my work with vg. I'd like to contextualize that within the overall GA4GH efforts around changes in reference technology and compressed evidence and population data.

Although I support the work of the GA4GH standards groups in principle, I do not think that specification-first development is likely to generate usable systems. Ideally, we should seek to produce standards when there are several existing systems that operate in the same space and need to communicate using a common pattern. This gives teeth to the standards process, which it needs in order to have impact.

I urge participants to avoid the bikeshedding trap of familiar client/server interface minutia and familiar system architectures and instead tackle the difficult issues and concepts here which necessarily lie outside of our experience. The only way to do this is to try to make something that works and see what happens. A number of people need to do this to be able to have a constructive or meaningful standardization process.

vg has taken almost exactly 6 months to develop from a protobuf schema I sent @lh3 to a system that is capable of constructing variation graphs, indexing them, and align reads against them efficiently enough that we can consider putting it into production at the Sanger provided minor improvements in performance and work flow. (@macieksmuga one lesson is that compressed graph indexes are 10s of GBs and compressed kmer indexes are 100s of GBs. sqlite may be inappropriate for dbs of this scale.)

6 months of a PhD student's time is a substantial commitment, but it should be encouraging that millions of $ or £ is not required to make an equivalent system. Others should not be afraid to take on the task as a clearly-defined project with a reasonable time frame. This means there is still enough time for several other groups to develop functioning systems within the 2 year time frame. With luck, in another six months we can figure out how to get them to talk to each other. Before getting them to talk to each other, they should at least be able to communicate with themselves and their human operators. They should work, and produce useful results. The basic interfaces and functions they are built from will describe the likely aspects of any common, standard, public interface they might have.

This all said, I'd like to stress that I feel the GA4GH community is immensely valuable, and I have learned a lot from these conversations, even incorporating lessons from discussions among group members into my own work. I suspect others have had similar experiences, and in this sense I see the last year as a great success. My point is that now is the time to put our hands to work on practical systems, or the legacy of the work will be mostly conceptual and community-building.

And, finally, I invite everyone here to participate on the development of vg. It is a fully-functional, open source graph genome implementation. It's been developed in a test-driven pattern that should make it very easy for other contributors to extend without breaking. It is designed to be integrated as library into other projects. The vg executable is itself a loosely-coupled integration of various functions provided by the library and also the primary point of testing. This could be a viable option if you'd prefer to work on a particular aspect of the problem (such as the GA4GH APIs) than develop the entire stack from scratch. I will be more than happy to support anyone who pursues this.

jeromekelleher commented 9 years ago

It's great to see that vg is progressing so well @ekg, and I'm certain that it's going to be a key tool in the future. I agree with the majority of what you're saying here, but I'm not sure it's very relevant to the original point of this issue (which is a very specific, technical issue regarding the reference implementation). Perhaps the data working group or the reference variation mailing lists would be a better venue for the points that you are making here?

@pgrosu, it's great to have input from a broad range of perspectives, but do you think you could keep your comments on these threads on topic please? These issues exist to help us organise an active software development project, and discussions of high-level topics disrupt this process.

ekg commented 9 years ago

@jeromekelleher point taken. Should we be starting new issues somewhere to discuss higher level topics?

jeromekelleher commented 9 years ago

Maybe an email to the data working group mailing list? As I say, I agree with the majority of what you are saying and you are making good points, so it would be good to have the discussion.

fnothaft commented 9 years ago

Maybe an email to the data working group mailing list? As I say, I agree with the majority of what you are saying and you are making good points, so it would be good to have the discussion.

+1

ekg commented 9 years ago

@jeromekelleher That sounds reasonable! Sorry for the noise.

pgrosu commented 9 years ago

@fnothaft, I understand that the reference server purpose is to test the API implementation, but if the API implementation is not scalable then the core framework would need to be redesigned. Are small labs supposed to use this server in-house with not a lot of data, or would they populate it with gigantic datasets with multi-user access? If the in-house datasets are small, what meaningful results can they get? Can these servers connect with other servers/repositories for more informative analysis? The reason why GA4GH was created is to share the data effectively with/across many locations, and for this there is a whole domain of distributed approaches that scale appropriately for proper integration - (thank you for the Spark implementation, which I was not aware of). Think of Beacon or Matchmaker connecting to hundreds of labs that have these servers, and updating according its results. It took me years to understand distributed algorithms/architectures and the work that goes into the proper planning, testing and deployment of scalable solutions, where using standards was key. @ekg clarifies exactly the points I was trying to make - thank you Erik!

@jeromekelleher, sorry to be disruptive - which was not my intent - but for several times (first time, second time) I asked regarding getting signed up to the mailing list/conference calls, but nothing happened so unfortunately Github discussions are the only way I can do this. Here is one of the first times I see the discussion mentioning possibly distributing the docker image (server+data) across many clusters, when we still have to start to discuss/plan/implement/test scalability and interconnectivity features in the API. I only keep bringing these perspectives up, since I see ourselves carving up a narrower path that veers farther away from what the GA4GH scalable and integrative goals are.

So I would be happy to post on the mailing list, but how can I get on it?

Thanks, Paul

diekhans commented 9 years ago

A lot of the issues discussed here related to discussion in: Inheritance in GA4GH schemas (#264)

Theses are issues related to design approaches and goals for GA4GH with Avro. A lot of good discussion here. Perhaps some of the comments people want to be moved forward can be added here.

Erik Garrison notifications@github.com writes:

@jeromekelleher That sounds reasonable! Sorry for the noise.

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

@diekhans, I think you might have meant this link, for which I'm still drafting up an analysis:

https://github.com/ga4gh/schemas/issues/264

~p

diekhans commented 9 years ago

Yes, that was it.. too many tickets!!

Paul Grosu notifications@github.com writes:

@diekhans, I think you might have meant this link, for which I'm still drafting up an analysis:

ga4gh/schemas#264

~p

— Reply to this email directly or view it on GitHub.*

pgrosu commented 9 years ago

I know the feeling - I've been through that experience many times in the past :)

benedictpaten commented 9 years ago

These are good points Erik, worth starting an alternative thread.

I'd point out there are several graph alignment schemes now in existence/aborning - none are likely to be using the GA4GH API directly for their data structures/algorithms.

I view the API we've defined as a not crazy way of communicating our results in a common format and comparing them. It may well grow and change over time, but that is the initial aim. On the issue of SQLite - we certainly won't keep it around long term, but it is pragmatic starting point for this project.

On Tue, Mar 31, 2015 at 7:35 AM, Erik Garrison notifications@github.com wrote:

@pgrosu https://github.com/pgrosu brought me here by mentioning my work with vg https://github.com/ekg/vg. I'd like to contextualize that within the overall GA4GH efforts around changes in reference technology and compressed evidence and population data.

Although I support the work of the GA4GH standards groups in principle, I do not think that specification-first development is likely to generate usable systems. Ideally, we should seek to produce standards when there are several existing systems that operate in the same space and need to communicate using a common pattern. This gives teeth to the standards process, which it needs in order to have impact.

I urge participants to avoid the bikeshedding https://en.wikipedia.org/wiki/Parkinson%27s_law_of_triviality trap of familiar client/server interface minutia and familiar system architectures and instead tackle the difficult issues and concepts here which necessarily lie outside of our experience. The only way to do this is to try to make something that works and see what happens. A number of people need to do this to be able to have a constructive or meaningful standardization process.

vg has taken almost exactly 6 months to develop from a protobuf schema I sent @lh3 https://github.com/lh3 to a system that is capable of constructing variation graphs, indexing them, and align reads against them efficiently enough that we can consider putting it into production at the Sanger provided minor improvements in performance and work flow. ( @macieksmuga https://github.com/macieksmuga one lesson is that compressed graph indexes are 10s of GBs and compressed kmer indexes are 100s of GBs. sqlite may be inappropriate for dbs of this scale.)

6 months of a PhD student's time is a substantial commitment, but it should be encouraging that millions of $ or £ is not required to make an equivalent system. Others should not be afraid to take on the task as a clearly-defined project with a reasonable time frame. This means there is still enough time for several other groups to develop functioning systems within the 2 year time frame. With luck, in another six months we can figure out how to get them to talk to each other. Before getting them to talk to each other, they should at least be able to communicate with themselves and their human operators. They should work, and produce useful results. The basic interfaces and functions they are built from will describe the likely aspects of any common, standard, public interface they might have.

This all said, I'd like to stress that I feel the GA4GH community is immensely valuable, and I have learned a lot from these conversations, even incorporating lessons from discussions among group members into my own work. I suspect others have had similar experiences, and in this sense I see the last year as a great success. My point is that now is the time to put our hands to work on practical systems, or the legacy of the work will be mostly conceptual and community-building.

And, finally, I invite everyone here to participate on the development of vg. It is a fully-functional, open source graph genome implementation. It's been developed in a test-driven pattern that should make it very easy for other contributors to extend without breaking. It is designed to be integrated as library into other projects. The vg executable https://github.com/ekg/vg/blob/master/main.cpp is itself a loosely-coupled integration of various functions provided by the library and also the primary point of testing. This could be a viable option if you'd prefer to work on a particular aspect of the problem (such as the GA4GH APIs) than develop the entire stack from scratch. I will be more than happy to support anyone who pursues this.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/server/issues/305#issuecomment-88112329.

afirth commented 9 years ago

350 should complete this. Includes #344 if anyone wants to build with that route. This dramatically reduces set up time, now < 5 minutes on OSX with decent network connectivity.