OpenTreeOfLife / opentree

Opentree browsing and curation web site. For overarching or cross-repo concerns, please see the 'germinator' repo.
http://tree.opentreeoflife.org/
BSD 2-Clause "Simplified" License
111 stars 26 forks source link

Loading duplicate taxa #771

Closed bomeara closed 9 years ago

bomeara commented 9 years ago

I tried loading the TimeTree 50K taxon newick tree from http://www.timetree.org/book into study ot_409 (which currently has just the bird part). I got this error:

Sorry, there was an error importing this tree. Please double-check its format and data. Hide details Conversion to NeXML failed. Error message: [Reading in.nex ] Executing Error: Taxon number 34091 (coded by the token 2011) has already been encountered in this tree. Duplication of taxa in a tree is prohibited. at line 1, column (approximately) 3163806 (and file position 1278161)

There probably are valid homonyms (between plants and animals) in this tree, though.

jar398 commented 9 years ago

None of the links at http://www.biodiversitycenter.org/ttol work for me; they all 404.

kcranston commented 9 years ago

There are links to download two different formats of the 50K tree on the page that @bomeara posted. I think that's what he is trying to upload through OpenTree, not the links on the biodiversitycenter.org page.

jar398 commented 9 years ago

Right. I went straight to the 'data deposit url' for the study​

jar398 commented 9 years ago

So, you are asking for an extension to the Newick format accepted by the library that we use (peyotl / NCL) that would permit duplicate tip labels? I'm not sure how we would implement that, even if we wanted to... maybe when a duplicate label is encountered, append a counter to it, to force that tip to have a distinct label? But then I don't know how the OTU mapping interface would distinguish the two tips in a way that's meaningful to the user.

What behavior did you expect?

kcranston commented 9 years ago

We do have trees in the datastore that have duplicate tip labels - that's why we have the study quality tests that ask people to pick an exemplar when there are duplicates.

mtholder commented 9 years ago

I have not looked at the NCL error for this study (it could be a bug). duplicate labels are illegal in NEXUS. newick is silent on them. Obviously it makes it tedious/error prone to refer to a leaf. So, I'm tempted to side w/ @jar398 and say that we don't accept that type of input.

josephwb commented 9 years ago

Hey @bomeara. I was holding off loading that tree into ot_409 because mapping is such a pain, I wanted to make sure that the Aves tree was complete before adding the whole thing.

kcranston commented 9 years ago

Ah, so we allow curators to map tips with different original labels to the same OTT IDs, and then we ask for disambiguation, but we follow the nexus requirement and don't allow duplicates in the original labels. Probably could use a better error message, though.

mtholder commented 9 years ago

@jar398 just looked at the input. It appears that quoting of labels would fix this. It is not really clear whether - in a name requires quotes in newick. Gary Olson's description does not list it, but the subset of newick authors who are also nexus authors wrote that nexus uses the newick format for trees, and in NEXUS - is a token breaker. That is how NCL treats it (and open tree uses NCL for this ingest).

bomeara commented 9 years ago

Ok, thanks, all. @josephwb, for downstream dating having the giant timetree would be handy, thus my interest. I can modify the input tree so it has unique labels.

@jar398, OpenTree's TNRS does use context specific resolution, I believe: if I upload a tree of birds, and one name there could be a bird or a fern species, it figures out it's likelier to be a bird. For a tree like this, it could look at taxa "near" it to help, but barring that, my hope for the behavior was loading the tree and leaving to the user to deal with the resolution. It's understandable why this isn't possible, though.

jar398 commented 9 years ago

What you say about context is I think true, but only when it comes to choosing between alternatives that are in OTT. The situation Joseph described is that of a bird in TimeTree that is unknown to Open Tree, but has the same name as that of a fern known to Open Tree. In this situation, if you're ingesting the entire tree, Open Tree's TNRS will mistakenly auto-match the TT bird to the OT fern.

bomeara commented 9 years ago

Just to point out a bit of irony, ropensci/rotl had an issue with duplicate taxon names coming FROM OToL trees: https://github.com/ropensci/rotl/issues/46#issuecomment-136536980 . So one might argue that you shouldn't dish out what you can't take (duplicate names on trees) [joking].

jar398 commented 9 years ago

Agreed, probably we should retract the tip_label='ot:ott_taxon_name' feature, or at least document that you shouldn't use it if you want to be sure to get well-formed Newick.

On Tue, Sep 29, 2015 at 3:56 PM, bomeara notifications@github.com wrote:

Just to point out a bit of irony, ropensci/rotl had an issue with duplicate taxon names coming FROM OToL trees: ropensci/rotl#46 (comment) https://github.com/ropensci/rotl/issues/46#issuecomment-136536980 . So one might argue that you shouldn't dish out what you can't take (duplicate names on trees) [joking].

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/771#issuecomment-144170736 .

jar398 commented 9 years ago

oops should be no quotes on that, sorry

On Tue, Sep 29, 2015 at 4:01 PM, Jonathan A Rees rees@mumble.net wrote:

Agreed, probably we should retract the tip_label='ot:ott_taxon_name' feature, or at least document that you shouldn't use it if you want to be sure to get well-formed Newick.

On Tue, Sep 29, 2015 at 3:56 PM, bomeara notifications@github.com wrote:

Just to point out a bit of irony, ropensci/rotl had an issue with duplicate taxon names coming FROM OToL trees: ropensci/rotl#46 (comment) https://github.com/ropensci/rotl/issues/46#issuecomment-136536980 . So one might argue that you shouldn't dish out what you can't take (duplicate names on trees) [joking].

— Reply to this email directly or view it on GitHub https://github.com/OpenTreeOfLife/opentree/issues/771#issuecomment-144170736 .

bomeara commented 9 years ago

No, no, I was joking --- don't remove features, please. rotl can now deal with it, though I suppose documenting a potential issue could be helpful. It is well-formed Newick to have duplicate names (I believe), even though Nexus can't.

mtholder commented 9 years ago

Hi @bomeara, I agree with your take that it is well formed. I'll create a new issue for this in the 'curator' app issue, as we probably should move to a newick parser that is more forgiving. NCL was handy because it supported the newick->NeXML part (allowing us to just convert any input format to NeXML and then to NexSON), but it was really geared toward NEXUS (and NEXUS is stricter than newick).

With respect to this issue, I doubt that we'll accept duplicate labels.

However, I think that most of this issue was caused by NCL treating the label in question as 3 separate labels (pre dash content, the dash, and post-dash content) and then using the last label for the node. In the context of NEXUS those errors get caught in preceding TAXA, CHARACTERS, or DATA blocks. So, NCL does not try to give more diagnostic messages.