Planteome / plant-ontology

Repository for the Plant Ontology
Creative Commons Attribution 4.0 International
59 stars 17 forks source link

problem with Spanish synonyms #599

Open planteome-user opened 9 years ago

planteome-user commented 9 years ago

Some Spanish synonyms for PO terms that include non-ASCII characters seem to be specified incorrectly in the PO ontology file.

For example, check the first synonym in the stanza shown below. The character 'ó' looks fine on the PO Web site (http://plantontology.org/amigo/go.cgi?view=details&search_constraint=terms&depth=0&query=[PO:0006502](http://purl.obolibrary.org/obo/PO_0006502)) but in the published OBO file it is encoded as '&#243'.

[Term] id: PO:0006502 name: flower abscission zone namespace: plant_anatomy def: "Zone at base of the flower that contains an abscission (or separation) layer and a protective layer, both involved in the abscission of the flower and its parts." [GR:Pankaj_Jaiswal] synonym: "zona de absici&#243n de la flor (Spanish)" EXACT Spanish [POC:Maria_Alejandra_Gandolfo] synonym: "落花帯(層) (Japanese)" EXACT Japanese [NIG:Yukiko_Yamazaki] is_a: PO:0000146 ! abscission zone

relationship: part_of PO:0009046 ! flower

Reported by: tberardini

Original Ticket: obo/plant-ontology-po-term-requests/599

planteome-user commented 9 years ago

Please check into this issue. Thanks very much.

Original comment by: tberardini

planteome-user commented 9 years ago

Hi Tanya The "&#243" is an extended ascii code for the character 'ó'. That is how it is displayed on the browser. You can find more info about them here: http://www.ascii-code.com/ Is it causing some issue for you?

Original comment by: cooperl09

planteome-user commented 9 years ago

Our developer's comment:

"Yes, it does cause an issue for us: instead of just extracting text from the published PO data file and putting it into our system, our program needs to look for those "special cases" and decode them before using them.

Interestingly enough, we don't need to decode Japanese glyphs in the very same PO data files. PO data files appear to be encoded in UTF-8 and Japanese glyphs are being put there "as-is", without using any extended ASCII codes.

So the question is why not to put Spanish characters there the same way as [you do] with Japanese - "as is", without encoding?"

Thanks very much for your help.

Original comment by: tberardini

planteome-user commented 9 years ago

My understanding is that it is a requirement of the AmiGO browser. I will ask Justin to check into it and get back to you.

Original comment by: cooperl09

elserj commented 9 years ago

Took a crack at fixing this. First off, found we are using the incorrect terminology, had better luck working on it calling them "html entities" rather than ascii codes. I wrote a script that will successfully convert the spanish ones, unfortunately, it simultaneously broke the japanese ones. I am still working on a more permanent solution.

@tberardini

tberardini commented 9 years ago

@YarikM FYI