Execution environment - Githubissues

skeenan commented 10 years ago

Proposed topic for ASHG Execution environment - code touching the data - where do we stand on actually processing the data? Currently we have an REST API which is good for fetching and storing data. But in terms of actually running analysis, there are a lot of approaches. It would be good to tease out the overall vision of that - how do we envision that working.

e.g. here is a proposed new API, that when given a readSet identifier, will call variants and do something with the output.
Perhaps the GA4GH github should have more pieces of example code/client libraries where one of them would be a whole processing stack.
Need to figure out mechanism to delegate to distributed execution engines such that that one could have some consistency between them.
Reproducibility, Versioning. Consistent name-spacing and how do we execute against data.

max-biodatomics commented 10 years ago

BioDatomics is working on it now. There are several issues/questions:

How to distribute binaries?
How to provide compatibility for different operation systems
How to use incompatible libraries for different tools
How to provide security? - If any user can upload and execute his binaries on other clusters how to prevent malicious code.
How to run hadoop optimize tools and not hadoop optimized.
How to run distributed tools (required several machines for execution).
How to specify resources which is available for tool
Versioning?
Who and how will pay for resources usage. - Not many organizations will provide a possibility to run computationally intensive tasks on their system without payment.

So, there are a lot of questions. We personally resolved only part of them and looking solutions for others.

max-biodatomics commented 10 years ago

Will we discuss this topic on ASHG? I think it is important. It can help to simplify a transfer workflows/tools from one platform to another. Developers will be able to create one package which will run on all platform....

pgrosu commented 10 years ago

I think also the following approaches should be explored which can create significant speedup and can work with Hadoop:

MPI approaches: MPICH, MVAPICH2, OpenMPI, etc.)
GPU-processing via CUDA
Resource management via SLURM and others, that might provide additional improvements in load-balancing and dynamic resource managment.

Maybe later we can cover other topics as well, such as Infiniband/RDMA/RoCE/etc.

tetron commented 10 years ago

There is a working group that formed out of the 2014 Bioinformatics Open Source Conference Codefest which is working to develop a standard for describing bioinformatics computational workflows in a portable way. There is participation from representatives from a number of bioinformatics compute platforms including Galaxy, Seven Bridges, Mobyle, and Curoverse Arvados. This seems like an excellent opportunity for a potential collaboration, or possibly bringing the commmon-workflow-language working group under the umbrella of GA4GH:

https://groups.google.com/forum/#!forum/common-workflow-language

max-biodatomics commented 10 years ago

It is a good start. They are on early stage. So, Probably will be better to bring it to GA4GH. So, it need to be discussed. The specifications should be developed by broader group. As developers usually put in specifications requirements which they support and specially ignore what they don't.

Several examples: The current initiative choose to use DAG (Direct Acyclic Graph) workflows only. It will be not flexible at all. We as an example support Loops and our workflows will be impossible to describe in DAG to provide a full functional.

They didn't specify yet a tool description yet. But for example galaxy used Cheetah before. Which is working well only under python. It even doesn't work on Jython. So, it cannot be efficiently used on solutions based on Java, Ruby and other.

All members of bosc community doesn't work with hadoop. Is any chance that specifications will include integration of hadoop optimized tools?

The parameters specifications contain information on a files checksum. Hadoop provides a checksum control for all files automatically. If this parameter will be obligatory - it will make a hadoop based systems less efficient by computing checksum several times.

I think the standardization of API for tools and workflows will be beneficial for all community and it should be done by all group instead of few.

jmchilton commented 10 years ago

Just a quick correction on the tool description point - it is possible to get Jython to use Cheetah and evaluate at least a large subset of Galaxy tool files in a JVM environment a large subset of Galaxy tool files in a JVM environment - even within a much different execution framework and data storage and metadata model.

I did this for a service-oriented bioinformatics portal I developed called TINT in my past job. Notes I took (very dated) on building a Jython jar with cheetah built in can be found here.

Not that anyone is going to run out and re-do this - but I felt the need to be "that guy" and set the record straight :). I would be really keen to see some pragmatic standards developed and hopefully Galaxy's successful tool format (probably minus the Cheetah piece) can serve as a useful data point.

max-biodatomics commented 10 years ago

Thanks jmchilton.

We originally tried to used a Galaxy format as internal but after some attempt we found that it is not fit for us. We still have a similar by structure format but it is not compatible with Galaxy. Galaxy is a great starting point but it has some issues:

There are missing elements which can make almost automatic script creation. As an example with one such element we were able to automatically generate script after changing parameters. Missing some useful types of parameters.

In Galaxy is necessary to describe how to collect logs, and basically developer need to make wrapper for logs collection. It can be done in universal way for almost all tools.

I could be wrong about Galaxy. I didn't follow for their development for about one or two years.

As a summary, We need to establish formats for tools and workflows distribution. The questions I described in early post.

Common workflow language forum covered some of these questions. But I think GA4GH should have also standard for it.

max-biodatomics commented 10 years ago

The ASHG is finished. From my understanding the decision to create a new task team for Execution environment was made.

Let's move to a next steps:

I think we need to create a new Google group for this team. Can someone from GA create it? I can do it but I don't think it is a good idea that group owner is not from GA management team.
After it created we need to setup a schedule for weekly call.

awz commented 10 years ago

@max-biodatomics @massie I thought there was interest in building on the work of https://groups.google.com/forum/#!forum/common-workflow-language so we could just use that mailing list? I'm sure the owner of the group will happily transfer it to an "official" GA representative.

Matt - do you have any preferences? Are we going with the name: "Digests, Containers, Workflows"?

max-biodatomics commented 10 years ago

My suggestion is to move it to a GA. CWL - group is started recently, not much was done there. there is some differences in choosen implementation. They choose to use JSON as a descriptor but the tool definition and workflow description could be done in AVRO. (AVRO schema can be written in JSON or XML). Using AVRO for all execution part will make a GA standard more unified.

In CWL group is everyone pulling into his direction and everyone started from his own implementation. To select any specific existing solution will be not optimal as each implementation has limitations.

I think the best way to go is to use example of ReadTaskTeam. When solution started from defining format and important elements. Each team can make suggestions but adding it to a final version should be done by voting. Only this way will provide most robust way to develop a format which will be beneficial for everyone instead of one particular team.

pgrosu commented 10 years ago

Hear, hear Max! :)

skeenan commented 10 years ago

Hi Max,

Preparations are ongoing behind the scenes to choreograph the instantiation the 4 new tasks teams decided at ASHG, we’ll be getting to execution environment in turn. I will be working with the co-chairs to define the steps to bring this team online, and I happily take this post from you as a declaration of interest in participating in the team.

Regards,

Stephen

On 30 Oct 2014, at 17:08, MAX notifications@github.com wrote:

The ASHG is finished. From my understanding the decision to create a new task team for Execution environment was made.

Let's move to a next steps:

I think we need to create a new Google group for this team. Can someone from GA create it? I can do it but I don't think it is a good idea that group owner is not from GA management team.

After it created we need to setup a schedule for weekly call.

— Reply to this email directly or view it on GitHub.

pgrosu commented 10 years ago

Max, exploring at the beginning helps in my opinion which might seem why the CWL group is looking at many directions now :)

delagoya commented 9 years ago

Closing for lack of comments in a few months. Open again if needed.

ga4gh / ga4gh-schemas

Execution environment #149