Discuss project governance, relationships upstream and down

heuermh commented 6 years ago

@tomwhite @fnothaft @ryan-williams @jacarey @tfenne @lbergelson @cmnbroad @droazen @magicDGS @vdauwera @cseed

Sorry to @-mention you all here on this issue, but unfortunately I only know some of you by your github handles.

I would like to invite you all to attend, in person or virtually, the OpenBio Winter Codefest 2018 on Thursday Jan 18th and Friday Jan 19th in Boston, MA.

https://www.open-bio.org/wiki/Codefest_Winter_2018

Hadoop-BAM is an upstream dependency shared by GATK4 and ADAM (and possibly also Hail?) and is in need of clarification around project governance, similar to the process recently undertaken by htsjdk (https://github.com/samtools/htsjdk/issues/871). I see the Codefest as a good opportunity to discuss this and then branch out into other areas of possible collaboration between the various projects.

One possible way forward would be to incubate at Apache Software Foundation a new project starting from Hadoop-BAM, extending out into bdg-formats, bdg-utils, and ADAM on our side and common/utility code extracted from GATK4 and Hail, and perhaps even upstream into htsjdk. We welcome friendly competition on algorithms and analyses, but there is no reason to duplicate effort on the underlying technology stack.

Please feel free to discuss here on this issue or in the Codefest Gitter chat room. We can refine an agenda on the Codefest shared project ideas doc. Hope to meet some of you in Boston!

ryan-williams commented 6 years ago

I will coincidentally be in boston already that day so will definitely go to the workshop, looking forward to chatting w folks there, thanks for mentioning @heuermh!

kheljanko commented 6 years ago

Hi,

I am Keijo Heljanko, Associate Professor at Aalto University, who originally funded the Hadoop-BAM project (coded by Matti Niemenmaa and Andre Schumacher) at Aalto University, and we on purpose released it under the MIT licence to have it available to a maximum number of Hadoop and Spark based NGS processing pipelines. An Apache licence would have probably been even better but I did not understand that at the time, and I and I think all the developers also wanted a licence that allows for maximum flexibility on use of the Hadoop-BAM library by different projects.

I am based in Helsinki, Finland, so I will not be able to join you in Boston, but I would love to be involved in developing Hadoop-BAM further. In fact, my PhD student Ilari Maarala created another NGS pipeline using Hadoop-BAM as base technology, which just got published:

Altti Ilari Maarala Zurab Bzhalava Joakim Dillner Keijo Heljanko Davit Bzhalava Bioinformatics, btx702, https://doi.org/10.1093/bioinformatics/btx702

We are currently also working on a Spark based pan-genomics pipeline, which will eventually require new file formats and would make sense to be eventually supported by Hadoop-BAM.

We would love to be in discussions on how to improve Hadoop-BAM and related projects and techniques, having a common codebase with an Apache Incubation project would sound great! Please keep me in the loop, I can be reached by email or by Skype at "keijo.heljanko".

Keijo Heljanko

kheljanko commented 6 years ago

Oh, and Ilari's paper is called:

"ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads "

Here is the GitHub page for the ViraPipe project:

https://github.com/NGSeq/ViraPipe

magicDGS commented 6 years ago

First of all, thanks for including me in the discussion of the Hadoop-BAM governance. I won't be able to join the Codefest, but I would love to contribute to the project in whatever way it is possible as a downstream API user.

I like the idea of extracting common utilities of downstream project into the Hadoop-BAM to have a common framework to work in HDFS and other org.apache.hadoop.fs.FileSystem implementations. In addition, I have other ideas that might improve the usability and relationship with other projects:

Make releases with always up-to-date upstream projects (mostly related with HTSJDK, which is in active development, but also with the latest hadoop)
Relationship with HTSJDK: if a interface-based HTSJDK as proposed in https://github.com/samtools/htsjdk/issues/520 (and other issues) is done at some point, make Hadoop-BAM as agnostic as possible from minor releases of HTSJDK. This will make possible to use Hadoop-BAM with htsjdk versions > 3, > 4, etc

I am probably the less familiar with the codebase, but I am interested in learning more about it and contribute to the codebase as much as possible.

kheljanko commented 6 years ago

Hi,

Also, having a list of downstream projects and contacts from the projects using Hadoop-BAM would be quite useful to discuss the future of libraries handling genomics file formats, and how to contact people potentially working with the same issues.

heuermh commented 6 years ago

As far as I can tell, the list of Winter Codefest attendees has strong representation from the Broad for Cromwell but less so for GATK, and none for htsjdk or Hail. I am available to visit the Broad while in Boston, if that would help. The first part of collaborating is showing up!

@ryan-williams Looking forward to finally meeting you in person!

@kheljanko When I refer to project governance, I'm specifically thinking about the software license, copyright assignment, code of conduct, a Contributor License Agreement (CLA) process, Github project administration, documentation hosting, the release process, code signing keys, evolutionary vs. revolutionary changes, etc. We have a lot of this in place with the Big Data Genomics organization and can share our experience.

@magicDGS Thanks for joining the conversation here! I've spent many hours working around problems with htsjdk and have been frustrated by that project's inability to consider revolutionary changes. I hope that whatever process and governance we can put into place here, expanding the scope as necessary (whether through the Apache incubation process or otherwise), will provide a way forward for those changes to happen.

lbergelson commented 6 years ago

@heuermh I may be able to attend the codefest, I'm not certain yet. We've been pretty busy preparing the gatk4 release and subsequently following up issues exposed through the launch.

We'd definitely like to avoid duplicating effort, so anything that helps avoid fragmentation and duplication would be great. I'm a little worried that creating an Apache project to combine the "best of" java genomics might end up increasing fragmentation and duplication rather than reducing it. What would the benefits be of an Apache project? Naively I'm afraid that it might introduce either a lot more process overhead, or a mandatory requirement to use JIRA both of which I would like to avoid. We'd also prefer to keep an MIT or BSD license rather than switch to Apache for reasons that are out of my control.

I'm sorry you've had so much trouble with htsjdk. Htsjdk has been very neglected recently, but we're planning on investing a lot more effort into it in the near future. We're in the planning stage of a major revamp which should begin to fix some of the underlying issues, but we can only fix the problems we know about. It would be great if you could file issues describing the problems you've run into.

heuermh commented 6 years ago

@lbergelson Thank you for the reply, this is exactly the conversation I'd like to have. If you aren't able to attend the Codefest, perhaps we might be able to find some other time while I'm out in Boston.

tomwhite commented 6 years ago

Just getting back after being on leave for a month... Glad to see this being discussed. Are there any outcomes or discussions from the Codefest that you can share here @heuermh?

@kheljanko thanks for sharing the information about the virus paper - sounds interesting!

tomwhite commented 6 years ago

I've written a bit about a new Spark API on this ticket: #196. I've now added a page describing the scope and features of the new API. There's also a bit about a home and governance, which it would be good to discuss more.

@cmnbroad, @droazen, @fnothaft, @heuermh, @kheljanko, @lbergelson, @magicDGS, @ryan-williams (and others who are interested) - I'd like to propose an online meeting. If you are interested in participating, please fill in this Doodle poll to select a day: https://doodle.com/poll/fikirnp8wwsh6bfa

Thanks!

magicDGS commented 6 years ago

I'd love to join the discussion, but unfortunatelly I will be on vacation next week. I'm looking forward to see the result of the meeting here, so please keep us update on this thread. Thanks!

tomwhite commented 6 years ago

Thanks to everyone who responded to the Doodle poll! The result is here: https://doodle.com/poll/fikirnp8wwsh6bfa, Apr 10, 2018 at 5pm UK time (GMT+1). Unfortunately, there wasn't a slot that suited everyone, so @magicDGS and @kheljanko won't be able to make it - please let me know if you have anything you'd like to relay to the meeting.

Here's the hangout link: https://meet.google.com/amb-zxti-qwe

heuermh commented 6 years ago

I would like to:

Extract BGZF and related non-biology functionality from htsjdk and donate it to Apache Commons-IO or another general purpose library.
Rewrite core htsjdk domain classes for API correctness, style, and immutability. I have an approach here. There are others.
Evolve htsjdk into a reference implementation of the various specifications, or develop new reference implementation(s). https://github.com/samtools/htsjdk/commit/d181e9ea3415c956a0aeb100d640bc63731b138a
Combine the best of the various approaches to distributed reading/writing native bioinformatics formats (Hadoop-BAM, spark-bam, this branch, ADAM + bdg-utils, GATK4, Hail, Google Nucleus, etc.) into a single library or set of libraries.
Convert between domain models using a common library. I have an approach here, with biojava & ga4gh and other conversions. ADAM converts between htsjdk ↔ bdg-formats.
Collaborate successfully on these shared upstream dependencies, and compete on downstream applications/analyses.

tomwhite commented 6 years ago

V short summary of the meeting. There were three technical areas (raised by Frank):

Better split picking (from spark-bam)
htsjdk compatibility and generalization (e.g. more efficient representations for variants)
Spark RDDs vs datasets/dataframes (datasets are more efficient)

There was general agreement for all of these being in scope for a new project. There may need to be some phasing - e.g. have RDD implementations with existing htsjdk classes, and add others (e.g. dataset) in the future.

After discussing governance and hosting the next steps are (summarized by Ryan):

New code goes in a new GitHub repo under the samtools org
MIT-licensed
Credit to the lineage of the ideas and code (Hadoop-BAM, spark-bam)
Find a new name

Regarding naming, in the meeting a couple of names were suggested:

dist-bio
hts-sharded

I'd also like to put forward the following (in the Spark sequencing vein):

squark
speeq

I said I would send out a poll for the name. If you have any ideas or suggestions, please post them here so I can include them in the vote.

heuermh commented 6 years ago

Also

Implement directly on Spark Datasource v2 APIs

I wouldn't say that any decisions have been made with regards to hosting under the samtools organization or that the software license should be MIT. Part of the trouble with not having project governance is that no process is in place to make decisions.

How about this for a proposal:

Create a new repository unaffiliated any Github organization initialized with the MIT license.
Give write permissions to anyone on this thread, on the call, or otherwise interested.
Establish a process to make decisions. I suggest using the Apache Voting Process where everyone from step 2 are initially given binding votes.
Create issues/pull requests for various discussion topics: Github organization, repository name, domain name, Maven coordinates (groupId, artifactId), software license, copyright assignment, code of conduct, Contributor License Agreement (CLA), Github project administration, documentation hosting, the release process, code signing keys, etc.
As discussion settles on any given topic, call for a vote. Vote passes with at least three +1s and no -1s.

I will help with any or all of these.

magicDGS commented 6 years ago

My opinion about some points brought by @heuermh and the summary by @tomwhite:

HTSJDK related changes: I agree with you about non-biological code moving away from HTSJDK and move to a proper API-based library (and thus, to v3). But I believe that this is an effort that should be solved at the htsjdk level (and its governance). There are several discussions about it and I have also several proposals (check https://github.com/samtools/htsjdk/pull/928, https://github.com/samtools/htsjdk/pull/985, https://github.com/samtools/htsjdk/issues/896, https://github.com/samtools/htsjdk/issues/520, https://github.com/broadinstitute/gatk/issues/4340)
I agree with a new project combining the best from other projects, but I think that it is quite important to have a time-frame to have a functional library which provides full support as the current Hadoop-BAM. Otherwise, the efforts for maintaining this repository and the new one would delay the inclusion and fix of new code into downstream software.
I agree with a proper voting system to decide most of the governance stuff. Nevertheless, I disagree with starting a new repository before deciding some basics that can be discussed on this thread (or a different one). The basics before starting the new repository, from my point of view, are: project name and GitHub affiliation. Otherwise, the project might change ownership and link before even started, making it difficult to track from people not involved in the governance discussion. I do agree that once the repo with the new name (and under either samtools, HadoopGenomics or a new organiztion), discussion can start in different issues/PRs for license, etc.
Move to a different project sounds like a nice idea to fix stuff from scratch, but what that means for the fate of this project? If this is going to become an archived project, maybe others not included in this issue should be included in the voting process. And also it should be evaluated the impact on downstream projects...

Thus, my proposal is a bit different than the one from @heuermh:

Create a poll for name and affiliation
Create the new repository as decided
Start discussion on issues/PRs on that repository about the rest of topics: process to make decisions is the most important, but also about the rest.
Finally, decide the new design and what is needed in upstream projects (e.g., HTSJDK changes)

tomwhite commented 6 years ago

It seems like I may have misunderstood what was agreed (if only tentatively) in the meeting regarding github org (samtools) and license (MIT). In the meantime, here’s the code I’ve been working on (temporary location): https://github.com/tomwhite/squark.

@magicDGS regarding the fate of this project - it still needs to be maintained at least until any replacement has a superset of functionality. I plan to do another Hadoop-BAM release next week with a few changes.

heuermh commented 6 years ago

I've created a nameless Genomics on Apache Spark organization and repository https://github.com/nameless-gos/nameless

I'll flesh out the issues there with details from this issue and the email thread from the meeting.

tomwhite commented 6 years ago

I'd like to organise another meeting to see if we can make some progress on the new project. I've created a poll at https://doodle.com/poll/k9hc9sgbf9uhue7i to find a time. Please select which times you can make if you are interested in attending. Thanks.

fnothaft commented 6 years ago

Hi @tomwhite! I would love to join, but can't make any of the dates due to travel and other commitments. Can we do a time in June?

tomwhite commented 6 years ago

@fnothaft Sure - I was thinking of running a monthly meeting at least while we get things set up so I'll add a new poll for June - and go ahead with a meeting this month too (probably next week now).

tomwhite commented 6 years ago

Thanks everyone who responded to the poll! The result is here: https://doodle.com/poll/k9hc9sgbf9uhue7i. The time of the meeting is 5pm (GMT+1) on Tue May 22. Here's the hangout link: https://hangouts.google.com/hangouts/_/calendar/dG9tLmUud2hpdGVAZ21haWwuY29t.7grvb1g73s1svncmcg840jt6ee?authuser=0.

tomwhite commented 6 years ago

Thanks to everyone who attended the meeting. There was a desire expressed to work together on foundational projects to avoid duplication, starting with a more tightly-scoped focus than perhaps before - i.e. a Spark-native Hadoop-BAM project (Squark). We have actually been working on Hadoop-BAM together over the last few years, so it would be worth seeing how we can continue that with a better defined governance model.

Actions: please comment on the governance issues here: https://github.com/nameless-gos/nameless/issues.

heuermh commented 6 years ago

Note the Github organization and repository have changed, the new link is: https://github.com/disq-bio/disq/

heuermh commented 5 years ago

@tomwhite and I would like to submit an abstract on Disq to the BOSC 2019 conference.

Please consider adding your contact information and author affiliation to the following shared doc

https://docs.google.com/document/d/1by-YA5FQra8CyqMHOwa6278fNM0fZpD8PDD5XxLOIPM/edit?usp=sharing

cmnbroad commented 5 years ago

I took the liberty of adding @lbergelson in his absence, since he probably won't see this in time.

heuermh commented 3 years ago

Sorry for the ping, closing an old issue.

HadoopGenomics / Hadoop-BAM

Discuss project governance, relationships upstream and down #180