we need a few special case collection edit methods

mtholder commented 8 years ago

https://github.com/OpenTreeOfLife/germinator/issues/82 discusses some errors that cropped up in the context of multiple curators trying to add a study to the default synth collection. I originally thought that these were indicative a race condition, but that was an incorrect diagnosis.

The phylesystem-api for studies assumed that it would be pretty rare for multiple curators to be working on the same study at the same time, so we could tolerate the second curator's work being stranded on a work-in-progress branch until some manual fixes were performed on the server.

But the default synth collection (https://github.com/OpenTreeOfLife/collections-1/blob/master/collections-by-owner/opentreeoflife/default.json) is a single resource that multipe curators may be interested in. Particularly at workshops, when we are demonstrating to a group how to add a tree (and several people are playing along on their own laptops).

We also observed that users frequently want to just add a tree to the queue, and don't want to worry about the ranking.

That argues for an "add this tree" button. And the cleanest way to support that (and minimize chance of a change being stuck on a work in progress branch) seems like it would be a service in the API that:

takes a treeID, studyID pair
locks the collection repo
looks in the default synth collection for the study+tree pair, adding it to the end if it is not there.
unlocks the collections repo
returns a success or failure message to the client.

There would be no chance for the addition to end up on a work-in-progress branch in this workflow.

mtholder commented 8 years ago

Oh and the "few" in the title refers to the fact that we might want a "remove this study+tree" from the default list. and perhaps a "move this study+tree pair to a different collection".

I think that the add operation is the only one that is likely to be needed to assure smooth demos/workshops. removing or moving are probably rare enough, that I doubt we'll see lots of simultaneous usage.

jimallman commented 8 years ago

[Paraphrased from the original issue:] There might be simpler solutions using git's own configuration options.

I'm looking into clean/smudge filters, and whether we can configure them on GitHub. (We might be able to minimize these bugs by setting this up on the api server.)

I'm also taking a quick look at word-diff and other alternatives to the default line-wise behavior of git-diff.

jimallman commented 8 years ago

OK, after some exploration with git configuration, I'm almost ready to give up. @mtholder, I know you're familiar with git internals, so maybe you'll know whether this idea is a dead end.

I've worked out how to configure a custom diff-tool and merge-tool for a local repo, which in principle allows us to work in a JSON-aware way (parsing the JSON documents to resolve differences, instead of using git's dumb line-wise algorithm). But these only work "downstream" of the initial commit logic, once a merge conflict has been detected. I can't seem to change the linewise logic used during a commit, which is important in our most common use case: two different curators each add a tree, which git interprets as a trivial series of changes, instead of two distinct trees. Thoughts?

jimallman commented 8 years ago

Here's another thought on how to handle this requirement (easily adding a tree to synthesis): We keep exactly what we have, including the ^ot:candidateTreeForSynthesis list in Nexson and its checkboxes in the curation UI. (Note that my notes here assume the synthesis inputs are drawn from a nested (n-level) set of collections.)

When it's time to run synthesis, we simply create a temporary collection (by scanning the phylesystem data for preferred trees) and append it to the other synthesis inputs. To me, this makes sense for a few reasons:

The trees so marked are not in any particular order, so this makes as much sense (or more) as a normal collection.
We already capture the latest phylesystem SHA, so it's repeatable.
Synthesis will need to handle duplicate trees in any case, i.e. those appearing in more than one input collection, and possibly also marked as "preferred" in the curation app. Otherwise, we'll need to do some serious fiddling to prevent inclusion of a tree in more than one input collection. (I'm guessing it makes the most sense to use the tree in its earliest appearance and ignore later instances.)
This greatly simplifies the curation UI, since otherwise we'll need to:
- gather all trees currently queued for synthesis (probably a new API method);
- test an individual tree for inclusion in synthesis, probably returning its input collection(s)
- update these services whenever the main synth-collection (or any of its member collections) changes; in effect, probably whenever any collection is added, removed, or changed.
If and when someone adds a tree to a synthesis-input collection (the main collection, or one of its members, presumably n levels deep), this simply obviates the ^ot:candidateTreeForSynthesis flag. We might still add the services described above, so that we can indicate a tree that's already queued for synthesis by virtue of its inclusion in one of these collections.

kcranston commented 8 years ago

I like this last idea from @jimallman : it gets us what we want (an easy way for people to add trees to synthesis) with very little overhead. As part of synthesis, we can scan phylesystem and check curation status for proposed trees and add those to a collection in one operation.

mtholder commented 8 years ago

I don't have a problem with this (@jimallman's suggestion to use ^ot:candidateTreeForSynthesis) is an architectural solution.

We may face a bit of a social hurdle because (up until now) using the ^ot:candidateTreeForSynthesis was just a flag that a curator thought the tree was ready, and another step had to be taken to actually push the tree into the synth. So right now, I think that we have 1053 studies that have this property set to at least one tree. I'm not sure how many of those trees are actually good to go. Doubling our number of input trees in the next synth run may reveal scaling problems or a confusing mess because of issues with these trees. These issues could be hard to track down with a big jump in the number of trees.

But that is a short term issue.

Presumably we can fix any scaling issues, and untoggle the ^ot:candidateTreeForSynthesis flag for any tree that seems to be a problem.

kcranston commented 8 years ago

Longer term, I think we may want to add a validation step before someone can propose a tree for synthesis. Short term, we will almost certainly need a validation step before propinquity puts the proposed trees into synthesis.

jimallman commented 8 years ago

(up until now) using the ^ot:candidateTreeForSynthesis was just a flag that a curator thought the tree was ready, and another step had to be taken to actually push the tree into the synth.

Don't we have the same problem if the UI adds each tree to a collection for synthesis?

The current behavior is easy to maintain, just by treating these trees as candidates and expecting a curator to assign them to a proper synth collection.

mtholder commented 8 years ago

We do have the same problem if the UI just adds trees to a grab bag. The only difference is that the grab-bag (opentreeoflife/default collection) has just 4 trees rather than hundreds of trees.

As I say, I don't object to using ^ot:candidateTreeForSynthesis, but I do think we might run into some headaches initially.

I agree that the synth should just filter trees that appear multiple times (retaining only the first). Currently it would not matter (for the output tree) if the same tree shows up more than one time.

jimallman commented 8 years ago

As I say, I don't object to using ^ot:candidateTreeForSynthesis, but I do think we might run into some headaches initially.

Ah, because of the legacy data. Gotcha.

OpenTreeOfLife / phylesystem-api

we need a few special case collection edit methods #175