arborworkflows / ArborWebApps

A bundle of Tangelo applications used by NSF Arbor (Phylogenetic Comparative Methods system)
Apache License 2.0
9 stars 1 forks source link

Do not suppress flagged or synonym taxa #73

Open jar398 opened 10 years ago

jar398 commented 10 years ago

I found the following comment in the web application at https://arbor.kitware.com/:

    # if this match candidate is a non-taxon, then the flags attribute will have an entry like 'SIBLING_LOWER',
    # so we will filter out any of these and only match on empty flags.  Also if there are synonym returns, filter them out
    # and pick the main match instead. We add the returned information into a new row and accumulate the rows for an output table.

(I wish I could tell you where, but your web pages do not have their own URLs, so I can't link to it! You might want to have a look at http://www.w3.org/TR/webarch/ for received wisdom on when and how to use URLs.)

Anyhow there are two errors here. One is that the flags are almost all benign and should not be used for filtering. All the taxa that taxomachine keeps are perfectly fine taxa. The flags include 'edited' which just means that an ad hoc patch has been applied at that point, 'extinct' which just means extinct, and so on. There are flags for 'incertae sedis' but just because something's incertae sedis doesn't diminish its legitimacy; currently these are discarded for synthesis, but that's more a limitation of treemachine than any criticism of the taxon. Better to let treemachine do its own filtering rather than try to take care of it yourself. The example you give (sibling_lower) just means that the taxon has a sibling of a different rank, and if used as a filter will remove huge numbers of important taxa, such subfamily Homininae.

I recommend not filtering by flag.

The other problem is the treatment of synonyms. It may be true that if a name matches both a synonym and a non-synonym, then you might want to prefer the non-synonym, but it would be better if the ambiguity were resolved in the same way as any other homonym situation (maybe manually) - it may be that the synonym is after all the right choice. And certainly if the only match is to a synonym, it's very likely that the unique recovered OTT id is the correct match for that name. There are hundreds of thousands of synonyms reflecting revisions of classifications, and they are perfectly legitimate and useful.

Best Jonathan

curtislisle commented 10 years ago

Jonathan, Thanks for your explanation. I had expected these Arbor modules to be placeholders to start out the Hackathon week and that understanding of the TNRS would be refined during the week. This didn’t really happen, primarily because I didn’t force this to be investigated. One of the traitathon team was going to investigate and make recommendations for me.

I Agree on both your points: (1) flag should not be used a filter and (2) ignoring synonyms is not a correct thing to do. I realize there are situations where one of the synonyms would be the better choice.

The Arbor method we are discussing, “Lookup Names from OpenTree Taxonomy”, is being deprecated by me in favor of a “first return” simple solution, which doesn’t use flags and selects the first return. Then we plan to develop a more configurable method, which does allow the selection of synonyms.

I will likely get in touch again as we develop the way to better utilize the TNRS work done in OpenTree. I hope you had a good time at the Hackathon! Take care!

Curt

On Sep 22, 2014, at 9:25 AM, Jonathan A Rees notifications@github.com wrote:

I found the following comment in the web application at https://arbor.kitware.com/:

# if this match candidate is a non-taxon, then the flags attribute will have an entry like 'SIBLING_LOWER',
# so we will filter out any of these and only match on empty flags.  Also if there are synonym returns, filter them out
# and pick the main match instead. We add the returned information into a new row and accumulate the rows for an output table.

(I wish I could tell you where, but your web pages do not have their own URLs, so I can't link to it! You might want to have a look at http://www.w3.org/TR/webarch/ for received wisdom on when and how to use URLs.)

Anyhow there are two errors here. One is that the flags are almost all benign and should not be used for filtering. All the taxa that taxomachine keeps are perfectly fine taxa. The flags include 'edited' which just means that an ad hoc patch has been applied at that point, 'extinct' which just means extinct, and so on. There are flags for 'incertae sedis' but just because something's incertae sedis doesn't diminish its legitimacy; currently these are discarded for synthesis, but that's more a limitation of treemachine than any criticism of the taxon. Better to let treemachine do its own filtering rather than try to take care of it yourself. The example you give (sibling_lower) just means that the taxon has a sibling of a different rank, and if used as a filter will remove huge numbers of important taxa, such subfamily Homininae.

I recommend not filtering by flag.

The other problem is the treatment of synonyms. It may be true that if a name matches both a synonym and a non-synonym, then you might want to prefer the non-synonym, but it would be better if the ambiguity were resolved in the same way as any other homonym situation (maybe manually) - it may be that the synonym is after all the right choice. And certainly if the only match is to a synonym, it's very likely that the unique recovered OTT id is the correct match for that name. There are hundreds of thousands of synonyms reflecting revisions of classifications, and they are perfectly legitimate and useful.

Best Jonathan

— Reply to this email directly or view it on GitHub.

chinchliff commented 9 years ago

@curtislisle I just left a note in the script for the 'first return' solution you mention above. That solution is very likely to accept bad results. For instance, anytime the user puts in a species name that is not recognized in OTT, it will be replaced by some incorrect name that the TNRS thinks is similar. I don't think this is what users would want or expect. I think for this kind of simple solution, there needs to be some conservatism--if a name is not matched well enough, then assume it can't be matched and exclude it from further operations. There are some data in the TNRS results that indicate these cases, such as quality and whether or not the name was matched exactly, etc.

Ultimately, for full utility of the TNRS, there will need to be some kind of user-confirmation and/or selection of ambiguous results. In the mean time, it seems better to exclude ambiguous results than to accept erroneous ones.

curtislisle commented 9 years ago

Thanks, Cody. Agreed. This “first return” was just put in place to test the initial API interface. If memory servers, there is a Verbose Open Tree TNRS method that returns the full record for review, right? I agree about needing to provide the TRNS better guidance so good matches are returned. I figured that you would be able to suggest better use of the OTL API.

One thing that Arbor doesn’t current have is a way to interrupt flow and pop up a dialog box for users to choose an option, then continue processing. You’ve seen how the major algorithms produce status outputs to show how they ran, for review afterwards. I’d like to improve the feedback possibilities, but we should work within the “forward processing only” model in the meantime.

In particular for TNRS, I am open to suggestions of how we could apply a smart algorithm to decide when to choose a good enough match and when to return a verbose error answer, providing several suggested TNRS resolutions. Then the user could modify the data and re-run the pipeline to get better results.

On Jul 29, 2015, at 2:21 PM, Cody Hinchliff notifications@github.com wrote:

@curtislisle I just left a note in the script for the 'first return' solution you mention above. That solution is very likely to accept bad results. For instance, anytime the user puts in a species name that is not recognized in OTT, it will be replaced by some incorrect name that the TNRS thinks is similar. I don't think this is what users would want or expect. I think for this kind of simple solution, there needs to be some conservatism--if a name is not matched well enough, then assume it can't be matched and exclude it from further operations. There are some data in the TNRS results that indicate these cases, such as quality and whether or not the name was matched exactly, etc.

Ultimately, for full utility of the TNRS, there will need to be some kind of user-confirmation and/or selection of ambiguous results. In the mean time, it seems better to exclude ambiguous results than to accept erroneous ones.

— Reply to this email directly or view it on GitHub.

chinchliff commented 9 years ago

Hey Curt, that makes sense. I have noticed that there is no feature to allow the user to interrupt a workflow. I think your idea of using a 'smart algorithm' and more verbose reporting for unmatched names makes sense. It's straightforward to identify high quality matches from the TNRS--they have a score of 1 and the 'is_approximate_match' field is false. Those are exact string matches, which are as reliable as the input names.

If those properties are otherwise, then the match is a fuzzy match. Fuzzy matches are highly suspect. First, because if a fuzzy match is returned, then it means there was no perfect match to the queried name, and second, because the 'score' value for a fuzzy match doesn't necessarily reflect how good it is, but rather simply the basic string similarity to the matched name, so it's difficult to know algorithmically which fuzzy match is correct (if any).

I would recommend auto-accepting all exact matches, and returning lists of the fuzzy matches for user review. I'm not sure the best way to list the fuzzy matches, as there may be zero to many fuzzy matches for each input name. I suppose they could be listed in a standard table with redundant entries for the queried name, e.g.

queried_name possible_match
Anolis mispellis Anolis fuzzimatchius
Anolis mispellis Anolis questionabilis
Anolis mispellis Anolis incorrectis
Anolis narius Anolis marius
Anolis narius Anolis varius

...Or a list:

queried_name possible_matches
Anolis mispellis Anolis fuzzimatchius, Anolis questionabilis, Anolis incorrectis
Anolis narius Anolis marius, Anolis varius

...But maybe there is something better?

chinchliff commented 9 years ago

For now, I am just adding the verbose script as a component to the workflows that use the TNRS first return script, so that the detailed results are provided in the workflow output.

image