OpenTreeOfLife / opentree

Opentree browsing and curation web site. For overarching or cross-repo concerns, please see the 'germinator' repo.
http://tree.opentreeoflife.org/
BSD 2-Clause "Simplified" License
110 stars 26 forks source link

taxonomy subtree web service newick result not parseable by Mesquite #636

Open balhoff opened 9 years ago

balhoff commented 9 years ago

I am trying retrieve the taxonomy for Chordata using the web services. I used this command line:

curl -X POST http://api.opentreeoflife.org/v2/taxonomy/subtree -H 'Content-type:application/json' -d '{"ott_id":125642, "label_format": "name"}' >chordata.json

The newick format result uses single quotes around names that contain some special characters such as parentheses or quotes. However some of these names contain single quotes as characters, which are not escaped in any way (and I am not sure how they should be escaped). After removing the surrounding JSON content, I tried to parse the newick text using Mesquite and a few online tree viewers. None could successfully parse the tree.

So, it would be good if all name characters were escaped however it is supposed to be done in newick. But for my own uses I would actually prefer if the tree itself were a JSON structure rather than a newick string. It would be a lot easier to parse for many consumers and there are libraries that take care of all the necessary escaping.

mtholder commented 9 years ago

Hi Jim, i just wanted to crosslink this to: https://github.com/OpenTreeOfLife/taxomachine/issues/74 https://github.com/OpenTreeOfLife/treemachine/issues/147 https://github.com/OpenTreeOfLife/ot-base/issues/10

We should definitely get the quoting of the newick fixed. I'm not sure what data structure you'd like to see for the tree-fully-in-JSON part of your request. We have a few in-house data models that we just for transport in JSON. Did you have something in mind?

mtholder commented 9 years ago

@balhoff Can you point me to one of the names that has a ' which is not correctly quoted?

The newick convention is: if the label has a newick token-breaker, put the label in single quotes. If there is a single-quote in the label, change that to 2 consecutive single quotes. It's a bit wacky.

balhoff commented 9 years ago

@mtholder in that case it sounds like the single quotes are being correctly quoted; I found double single quotes in some names and assumed they originated from OCR'd double quotes at some point. Sorry for misleading on that point!

Now I'm not sure what is causing the failure in Mesquite. It loads up a little over 600 taxa out of what is supposed to be a much greater number (there are 97,137 commas in the output), but it doesn't report an error. I will try a smaller tree and see if Mesquite handles the double single quote correctly; I'll report back whether that is the explanation.

josephwb commented 9 years ago

That tree opens fine with Dendroscope, for what it's worth.

Agreed on the wacky call.

balhoff commented 9 years ago

I did a little test with 3 taxa in Mesquite and it handled the quoted quotes. So I will need to put some more time into figuring out where it is going wrong. I will comment here if I figure it out. I changed the issue title since I'm not sure that there is actually any problem with your output.

balhoff commented 9 years ago

I would be open to various data models for the tree-in-JSON output. Mainly I think it would facilitate using the output of the service in a lot more contexts, since JSON parsers are available everywhere, and are very consistent, and newick parsers are less common and can be really inconsistent.

In my case I actually just want to grab all the taxon names and I don't care about the tree at the moment. That would be easy to do with any JSON format. I was trying to avoid writing a one-off newick parser for Scala. This would be sufficient (assuming the result is always a tree?):

{"name": "Chordata",
"ott_id": 125642,
"children": [
    {"name": "blah1",
     "ott_id": 1111111,
    "children": []
    }
    {"name": "blah2",
     "ott_id": 222222,
     "children": []
    }
    ]
}
balhoff commented 9 years ago

I've found that Mesquite is tripping up on unquoted hyphens (-) and asterisks (*). I replaced all of these with underscores and it seems to have loaded the full tree (93,593 OTUs). I suppose this is Mesquite's fault since these don't seem to be special characters in newick format.

I'll leave it up to you whether to close this issue. Thanks!

mtholder commented 9 years ago

I suppose that mesquite is using NEXUS rather than newick token-breaking rules. Still, there is not really any reason not to quote these (if it will make it easier for people to work with mesquite). So I'd vote that we quote these.