ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

clean up top-level site navigation (for docs and for code) #40

Closed dglazer closed 10 years ago

dglazer commented 10 years ago

It's too confusing to find, and to navigate between:

(Thanks @haussler for reporting.)

dglazer commented 10 years ago

A few more comments from the discussion group:

richarddurbin commented 10 years ago

I don't understand why we don't have just three directories, a top one with documents relevant to the whole Data working group, and a subdirectory for each task group with documents relevant to that task group: the Avro and the documentation. I would only make a subdirectory beyond those once there were more than 10-20 documents in a natural subclass.

That much flatter structure would look natural to most people I think.

Apologies for my luddite comments. Personally I managed to find the APIs but have never found the documentation, though I haven't looked hard.

Richard

On 8 May 2014, at 14:47, David Glazer notifications@github.com wrote:

A few more comments from the discussion group:

the top-level doc URL (ga4gh.github.io) isn't very memorable it's not obvious that both code and doc exist (which leads people looking for doc to get stuck in and overwhelmed by code) the APIs link at the top of the doc page does the right thing, but is easy to miss (given how quiet it is compared to the louder text in the body, that takes you to individual repos) — Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

dglazer commented 10 years ago

@richarddurbin , I don't love the extra clicks to get to the Avro, but I do like that we have a structure that is ready for growth -- we should be able to add in sample apps and reference implementations, with all the associated auto-build machinery, without moving stuff around. Open to ideas on how to get both, but I don't have any.

(And btw -- thanks to @massie and @cassiedoll for setting up what we have -- I think it's already working well for contributors, and only a few small changes away from working well for casual visitors also.)

lh3 commented 10 years ago

There are a few problems with the current structures:

  1. It is hard to find the relevant information unless you are very familiar with it.
  2. Task groups are disconnected. For example, Benedict wanted to include ga4gh.avdl, but it is in a separate git repository and cannot be referenced. For another example, there is a beacon.avdl in ReadTaskGroup, but there is a separate Beacon repository at the top level. After all, various task groups will be connected (VarRef task group is connecting to ReadTaskGroup). It makes more sense to put them in one repository by merging the others into ReadTaskGroup and then renaming the repository name.
  3. File naming is not appropriate because task groups are separated. For example, ga4gh.avdl in ReadTaskGroup is not quite right. It primarily describes the read store only.
richarddurbin commented 10 years ago

Helo David and everyone,

I can David's points are good. Thanks also to @massie and @cassiedoll. Sorry, my previous email was unnecessarily intemperate. Maybe as an alternative we can just put direct links on the main text of the front page to the currently active avro idl specification pages and the top level documents, including "how to"? As things develop with implementations etc. the links on the top page can change. This would make the top page more of a current menu to go to the current activities of the working group.

Richard

On 8 May 2014, at 15:02, David Glazer notifications@github.com wrote:

@richarddurbin , I don't love the extra clicks to get to the Avro, but I do like that we have a structure that is ready for growth -- we should be able to add in sample apps and reference implementations, with all the associated auto-build machinery, without moving stuff around. Open to ideas on how to get both, but I don't have any.

(And btw -- thanks to @massie and @cassiedoll for setting up what we have -- I think it's already working well for contributors, and only a few small changes away from working well for casual visitors also.)

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

fnothaft commented 10 years ago

RE:

Task groups are disconnected. For example, Benedict wanted to include ga4gh.avdl, but it is in a separate git repository and cannot be referenced. For another example, there is a beacon.avdl in ReadTaskGroup, but there is a separate Beacon repository at the top level. After all, various task groups will be connected (VarRef task group is connecting to ReadTaskGroup). It makes more sense to put them in one repository by merging the others into ReadTaskGroup and then renaming the repository name.

I don't think the correct solution is to have one large repository. I think the preferable approach is to publish our artifacts (e.g., to maven), so that other teams can pull the artifacts from wherever they are published. It will be necessary to publish artifacts later anyways, so that external projects (Company X's implementation of the GA4GH reads API, etc) have a clear way to pull the schemas.

adamnovak commented 10 years ago

Unfortunately, for the use case of referencing .avdl schemas in one project from another, the standard Maven way of pulling dependency jars won't cut it. The dependent project needs the dependency's source for the .avdl file in order to build, and it needs to be available at a predictable relative filesystem path for the compiler to find it.

The easiest way to set that up would be via Git submodules, but that just copies the one project inside the other. If we went this route, I would suggest splitting the .avdl files into their own separate Git repository.

A more engineering-intensive way would be to implement the method described here: < http://mail-archives.apache.org/mod_mbox/avro-user/201212.mbox/%3C97CDF5378BDCEA4FB609A50F0FD3599A833A9AF8@TUK2-EMSMBX3.intelius1.intelius.com%3E

Basically, the dependency (this project) needs to be modified as described at Mhttp:// mail-archives.apache.org/mod_mbox/avro-user/201212.mbox/%3C97CDF5378BDCEA4FB609A50F0FD3599A833A8A29@TUK2-EMSMBX3.intelius1.intelius.com%3E in order to build and publish a source .jar containing just the Avro schemas. Then the dependent project gets some more Maven wizardry to pull down and extract that source .jar into a predictable place so that schemas can reference it.

I think the second approach is probably better, even though it requires a bunch of engineering in the dependent project. I can make an issue (or maybe a pull request?) for that in a bit.

On Thu, May 8, 2014 at 8:14 AM, Frank Austin Nothaft < notifications@github.com> wrote:

RE:

Task groups are disconnected. For example, Benedict wanted to include ga4gh.avdl, but it is in a separate git repository and cannot be referenced. For another example, there is a beacon.avdl in ReadTaskGroup, but there is a separate Beacon repository at the top level. After all, various task groups will be connected (VarRef task group is connecting to ReadTaskGroup). It makes more sense to put them in one repository by merging the others into ReadTaskGroup and then renaming the repository name.

I don't think the correct solution is to have one large repository. I think the preferable approach is to publish our artifacts (e.g., to maven), so that other teams can pull the artifacts from wherever they are published. It will be necessary to publish artifacts later anyways, so that external projects (Company X's implementation of the GA4GH reads API, etc) have a clear way to pull the schemas.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42562258 .

lh3 commented 10 years ago

Hmm.. this maven thing looks overcomplicated to me while make it harder to access and integrate schemas. What is the benefit of separating ReadTaskGroup, VarRef task group, beacon etc. into separate repositories? We can still achieve auto-build after merging them together. I am not sure what "ready for growth" means. Does it mean we will put sample apps and reference implementations inside ReadTaskGroup? I would rather put all schemas in one git repository, and put each app/implementation in a separate repository having the schema repository as a git submodule.

fnothaft commented 10 years ago

@lh3 Maven is generally not complicated, but as @adamnovak points out, it may not be a good decision for our specific case. I don't think anyone is suggesting that we put sample apps/reference implementations into the ReadTaskGroup repository/etc.; my implication is that we'll want to publish artifacts for public consumption at a later date, which is a somewhat orthogonal concern.

I'll let @massie chime in, as he may know a workaround for the issues @adamnovak points out. If a good workaround doesn't exist, then I agree that having a unified repository for all schemas is a good approach.

lh3 commented 10 years ago

@fnothaft Thanks for the reply. I agree we can wait to see if @massie has a cleaner solution.

Meanwhile, I am still curious about the benefit of separating task teams into separate repositories, and what @dglazer means by "ready for growth".

adamnovak commented 10 years ago

I thought the reason for the separate repositories was that we were going to produce two different APIs (or collections of APIs): one for people working with read data, and one for people working with variant data. You might want to pull in the variant API but not the read API if you're doing a genome-wide association study, for example. Or in some applications you would use both.

On Thu, May 8, 2014 at 11:30 AM, Heng Li notifications@github.com wrote:

@fnothaft https://github.com/fnothaft Thanks for the reply. I agree we can wait to see if @massie https://github.com/massie has a cleaner solution.

Meanwhile, I am still curious about the benefit of separating task teams into separate repositories, and what @dglazer https://github.com/dglazermeans by "ready for growth".

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42587072 .

lh3 commented 10 years ago

Thanks for the clarification, @adamnovak. When our programs call a library, we are open to all its components even if we don't use them. I think the same can be applied in our case. If someone want to develop APIs, they can pull all the schemas but only use the part of interest. Our schemas are small and unlikely to cause bloated code. With schemas put together, it is easier for a user and us as well to find them, to understand their relationships, to improve info exchanges and to avoid inconsistencies (e.g. naming conflict).

haussler commented 10 years ago

While we are at it, can somebody please put under

https://github.com/ga4gh/FileFormatsTaskTeam

a link to the file formats task team GitHub site

https://github.com/samtools/hts-specs

Thanks!!! -D

On Thu, May 8, 2014 at 1:56 PM, Heng Li notifications@github.com wrote:

Thanks for the clarification, @adamnovak https://github.com/adamnovak. When our programs call a library, we are open to all its components even if we don't use them. I think the same can be applied in our case. If someone want to develop APIs, they can pull all the schemas but only use the part of interest. Our schemas are small and unlikely to cause bloated code. With schemas put together, it is easier for a user and us as well to find them, to understand their relationships, to improve info exchanges and to avoid inconsistencies (e.g. naming conflict).

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42605167 .

jmarshall commented 10 years ago

While we are at it, can somebody please put under [FileFormatsTaskTeam] a link to [hts-specs]

Good point -- I've put some basic placeholder text there now.

dglazer commented 10 years ago

Meanwhile, I am still curious about ... what @dglazer means by "ready for growth".

I've been assuming that we'll eventually want to include actual executable code (e.g. a reference implementation of a server, client libraries to make it easy to call the API in different environments, sample apps such as a genome browser). But that's an assumption; curious if others agree.

dglazer commented 10 years ago

I submitted a baby step at nav cleanup in the ga4gh.github.io repo -- ptal.

lh3 commented 10 years ago

I've been assuming that we'll eventually want to include actual executable code (e.g. a reference implementation of a server, client libraries to make it easy to call the API in different environments, sample apps such as a genome browser). But that's an assumption; curious if others agree.

I would prefer not to put schemas and implementations in the same repository. Implementations may frequently use schemas from more than one teams. ReadTaskTeam is likely to use schemas from metaTeam. VarRefTeam is calling ReadTaskTeam schema right now but doesn't know how to achieve that (#41). Having schemas together easily solves the problem with no negative effects so far as I am aware. The only issue is that we need to move files around, but we'd better do this earlier than later. What else prevents us from centralizing schemas?

I also reiterate the issue of schema inconsistency. In ReadTaskTeam schema, we use contigName and position, but in Beacon, we use chromosome and coordinate and have a new concept of referenceVersion. In ReadTaskTeam, we are thinking to adopt accession numbers in some way, but we are not talking about this in Beacon. Putting schemas together will make us more aware of such inconsistencies.

fnothaft commented 10 years ago

I am +1 to @lh3's comments about separating schemas and implementations. FWIW, in our work at Berkeley, we've found it preferable to both separate the schema and implementation, and to have the schemas for different data types (e.g., reads, variants, references) in the same repository, as there is significant overlap between data types.

dglazer commented 10 years ago

@lh3 and @fnothaft -- the idea of merging schema files into one directory / repo is growing on me, for all the reasons you cite. It will make it a little trickier to coordinate edits from many different task teams who aren't talking regularly, but as @lh3 points out we're going to have to pay that coordination price at some point anyway. I'd still really like to hear from @massie on the thinking behind his initial setup, but unless he has new arguments, I'd be +1 for a change that put all the schemas in one place.

@lh3, re schema inconsistency -- agreed. I submitted #43 as a baby step (on file naming); please +1 it if you agree.

lh3 commented 10 years ago

It will make it a little trickier to coordinate edits from many different task teams who aren't talking regularly

I think this should be fine as long as different task teams typically modify a few schema files relevant to their own team. This way commits from different task teams won't cause conflicts. We might also encourage to tag issues specific to each task team to alleviate the problem of a long and mixed issue list. EDIT: for example, put a ReadTaskTeam tag on an issue to indicate that the issue is primarily related to ReadTaskTeam.

fnothaft commented 10 years ago

@dglazer realistically, if we put all the schemas in a single repository, it forces coordination. It's better to pay the price up front, instead of diverging and needing to force convergence/coordination later.

haussler commented 10 years ago

+1 for putting all schemas in a single repository.

On Sat, May 10, 2014 at 9:24 AM, Frank Austin Nothaft < notifications@github.com> wrote:

@dglazer https://github.com/dglazer realistically, if we put all the schemas in a single repository, it forces coordination. It's better to pay the price up front, instead of diverging and needing to force convergence/coordination later.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42746496 .

haussler commented 10 years ago

I'm ccing people from other task teams and driver projects. I hope everybody is on board with putting all our schemas in a single repository.

On Sat, May 10, 2014 at 10:35 AM, David Haussler haussler@soe.ucsc.eduwrote:

+1 for putting all schemas in a single repository.

On Sat, May 10, 2014 at 9:24 AM, Frank Austin Nothaft < notifications@github.com> wrote:

@dglazer https://github.com/dglazer realistically, if we put all the schemas in a single repository, it forces coordination. It's better to pay the price up front, instead of diverging and needing to force convergence/coordination later.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42746496 .

richarddurbin commented 10 years ago

+1 for putting all schemas in a single repository.

On 10 May 2014, at 18:39, haussler notifications@github.com wrote:

I'm ccing people from other task teams and driver projects. I hope everybody is on board with putting all our schemas in a single repository.

On Sat, May 10, 2014 at 10:35 AM, David Haussler haussler@soe.ucsc.eduwrote:

+1 for putting all schemas in a single repository.

On Sat, May 10, 2014 at 9:24 AM, Frank Austin Nothaft < notifications@github.com> wrote:

@dglazer https://github.com/dglazer realistically, if we put all the schemas in a single repository, it forces coordination. It's better to pay the price up front, instead of diverging and needing to force convergence/coordination later.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42746496 .

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

cassiedoll commented 10 years ago

It looks like everyone is in agreement here on ending up with:

So unless there are any objections I can take the following steps to resolve this:

  1. delete all the empty repos (the ones which were never used like MetaDataTaskTeam and PCAP)
  2. rename this repo to "schemas" to preserve all of its history
  3. make 2 new labels "ReadsTaskTeam" and "RefVariationTaskTeam" (the active teams)
  4. tag all of our issues as "ReadsTaskTeam"
  5. the variants team can then move their code and issues into the schemas repo as their leisure. (there are only 5 issues and they are all pull requests so this should be straight forward)

I'll plan to take the first 4 steps this afternoon (everything listed is still reversible at this point)

Please shout if there are objections!

adamnovak commented 10 years ago

You might want to flatten the repository hierarchy a bit while you're at it. If other people want to write schemas based on our schemas, they'd want to pull this in a Git submodule inside their own Avro schemas directory, and it would help for the path names they need to import to be as short as possible.

-Adam

On Mon, May 12, 2014 at 11:54 AM, cassiedoll notifications@github.comwrote:

It looks like everyone is in agreement here on ending up with:

  • one "schemas" repo in ga4gh which will contain all the avdl files.
  • we'll use labels to keep the issues in this new repo separated by the various task teams.

So unless there are any objections I can take the following steps to resolve this:

  • delete all the empty repos (the ones which were never used like MetaDataTaskTeam and PCAP)
  • rename this repo to "schemas" to preserve all of its history
  • make 2 new labels "ReadsTaskTeam" and "RefVariationTaskTeam" (the active teams)
  • tag all of our issues as "ReadsTaskTeam"
  • the variants team can then move their code and issues into the schemas repo as their leisure. (there are only 5 issues and they are all pull requests so this should be straight forward)

I'll plan to take the 4 steps this afternoon (everything listed is still reversible at this point)

Please shout if there are objections!

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/ReadTaskTeam/issues/40#issuecomment-42872589 .

cassiedoll commented 10 years ago

Steps 1-4 don't involve moving code. Let's get this first part done and then we can have a pull request for collapsing the hierarchy (which I'm also in favor of)

lh3 commented 10 years ago

+1 to step 1-4. I agree with @adamnovak, but let's resolve 1-4 first as @cassiedoll suggested.

cassiedoll commented 10 years ago

2-4 are done. For 1 - I've submitted #45 so that the FileFormatsTaskTeam repo can be removed (the only commit is the readme)

The Beacon repo (3 markdown files) and the RefVariationTaskTeam repo (many files) will both need merging by someone on those teams.

Separately our readme should be improved to include all task team descriptions, I'll leave that to someone else who wants to play with words :)

massie commented 10 years ago

I'm +1 for having all the schemas in a single repository, keeping the API separate from implementations and publishing artifacts to Maven.

Another advantage of a single repo is that we'll be able to test all the schemas together using Travis in order to make sure that schema dependencies are never broken.

One thing to keep in mind -- for now we should focus on publish Java artifacts via Maven but we'll also likely want to publish a C/C++ library as well.

cassiedoll commented 10 years ago

Excellent. @benedictpaten is going to propose a pull request for merging the variation work over (part of #41)

After that I think there are 2 cleanup tasks left:

And someone should probably file a pull request for:

lh3 commented 10 years ago

+1 to "moving the avdl files up a couple of directories".

benedictpaten commented 10 years ago

+1. What about making the avro/ directory somewhat hierarchical, to make it easier to distinguish the refVar (for example) graph schemas from the read group schemas?

On Wed, May 14, 2014 at 12:27 PM, Heng Li notifications@github.com wrote:

+1 to "moving the avdl files up a couple of directories".

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/schemas/issues/40#issuecomment-43126040 .

cassiedoll commented 10 years ago

FYI: This issue now tracks the deletion of the RefVariationTaskTeam repo: https://github.com/ga4gh/RefVariationTaskTeam/issues/8

lh3 commented 10 years ago

What about making the avro/ directory somewhat hierarchical, to make it easier to distinguish the refVar (for example) graph schemas from the read group schemas?

I mildly prefer to put all schema files in the same directory. This seems cleaner and easier for outsiders who don't know the differences between task teams. Nonetheless, I don't have a strong opinion. I am happy to go with whatever the consensus is.

adamnovak commented 10 years ago

+1 on separating the different APIs to some extent. We probably want a couple of different .avdl files for the reference variation API (we have a whole hierarchy system to implement that's not in there yet), and it would be good to be able to see at a glance what .avdls belong to what API.

I would want /schemas/reads, /schemas/refvar, and so on, if possible. On May 14, 2014 12:29 PM, "Benedict Paten" notifications@github.com wrote:

+1. What about making the avro/ directory somewhat hierarchical, to make it easier to distinguish the refVar (for example) graph schemas from the read group schemas?

On Wed, May 14, 2014 at 12:27 PM, Heng Li notifications@github.com wrote:

+1 to "moving the avdl files up a couple of directories".

— Reply to this email directly or view it on GitHub< https://github.com/ga4gh/schemas/issues/40#issuecomment-43126040> .

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/schemas/issues/40#issuecomment-43126217 .

dglazer commented 10 years ago

Re directory depth -- for those who weren't on today's call -- @massie made a case for keeping the current structure (/schemas/src/main/resources/avro) -- I'll let him repeat it here.

Re sub-directories by sub-APIs -- I'm +0; it feels slightly tidier, but I'd also be okay waiting until we have a dozen or so files before worrying about it. (Note that we'll also want to keep some files in the root, and/or add a /common folder.)

benedictpaten commented 10 years ago

On Wed, May 14, 2014 at 5:08 PM, David Glazer notifications@github.comwrote:

Re directory depth -- for those who weren't on today's call -- @massiehttps://github.com/massiemade a case for keeping the current structure ( /schemas/src/main/resources/avro) -- I'll let him repeat it here.

I don't think it's a big deal, providing the API website is easy to find.

Re sub-directories by sub-APIs -- I'm +0; it feels slightly tidier, but I'd also be okay waiting until we have a dozen or so files before worrying about it. (Note that we'll also want to keep some files in the root, and/or add a /common folder.)

Yeah, I agree, I think we'll combine the referenceVariation schema files into one file for now, like the reads group. We need to start simple and make it easy for people to get the info, later we can split things up hierarchically.

— Reply to this email directly or view it on GitHubhttps://github.com/ga4gh/schemas/issues/40#issuecomment-43155515 .

delagoya commented 10 years ago

two comments:

  1. re source avro files in /schemas/src/main/resources/avro is a maven convention, but it is easily changed in the pom.xml file if there is an overriding need to include this repository as a git submodule to other projects. My vote would be to keep the current structure until we have to cross the repo-integration bridge.
  2. -1 on project subdirs. I think a single dir is good enough for our purposes
dglazer commented 10 years ago

Closing -- per Wednesday's call people are comfortable with the current state of site nav, so the original goal of this issue has been accomplished. If anyone has specific further changes they'd like, let's open them up as new issues (or better yet new pull requests).