GlobalNamesArchitecture / gni

Global Names Index
http://wiki.github.com/GlobalNamesArchitecture/gni
22 stars 2 forks source link

add common names to resolver result #40

Closed jhpoelen closed 8 years ago

jhpoelen commented 9 years ago

as discussed with @dimus -

In addition to taxon hierarchies, suggest to include available common names for resolved taxa. This would help me immensely in making the search features in http://globalbioticinteractions.org friendlier for humans.

Currently the resolver returns something like:

... data_source_id: 4, data_source_title: "NCBI", gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110", name_string: "Homo sapiens", canonical_form: "Homo sapiens", classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens", classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species", classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606", ...

suggested result (including common names) something like:

... data_source_id: 4, data_source_title: "NCBI", gni_uuid: "16f235a0-e4a3-529c-9b83-bd15fe722110", name_string: "Homo sapiens", canonical_form: "Homo sapiens", common_names: "human @en|Mensch @de|mens @nl", classification_path: "|Eukaryota|Opisthokonta|Metazoa|Eumetazoa|Bilateria|Coelomata|Deuterostomia|Chordata|Craniata|Vertebrata|Gnathostomata|Teleostomi|Euteleostomi|Sarcopterygii|Tetrapoda|Amniota|Mammalia|Theria|Eutheria|Euarchontoglires|Primates|Haplorrhini|Simiiformes|Catarrhini|Hominoidea|Hominidae|Homininae|Homo|Homo sapiens", classification_path_ranks: "|superkingdom||kingdom|||||phylum|subphylum||superclass||||||class|||superorder|order|suborder|infraorder|parvorder|superfamily|family|subfamily|genus|species", classification_path_ids: "131567|2759|33154|33208|6072|33213|33316|33511|7711|89593|7742|7776|117570|117571|8287|32523|32524|40674|32525|9347|314146|9443|376913|314293|9526|314295|9604|207598|9605|9606", taxon_id: "9606", ...

dimus commented 9 years ago

adding parameter with_vernaculars=true will add common names information to the output

jhpoelen commented 9 years ago

Thanks for adding the vernacular names @dimus . Some observations:

  1. for frogs (Anura), GBIF seems to include many languages but not English.
  2. WoRMS doesn't seems to have any vernaculars
  3. NCBI has vernaculars, but doesn't set the language (seems to be English by default).
  4. ITIS has some vernaculars, and they seem to be Spanish only. The language code that is used doesn't seem to be the two letter code that I am used to (e.g. "es"), instead it looks like bagre boca chica @Spanish.

You can find some specific example in https://github.com/jhpoelen/eol-globi-data/commit/5d6ab975bbc95ec79b47d287abc6e5230a098302 .

Are these results expected?

jhpoelen commented 9 years ago

After running a couple of batches in production with globalbioticinteractions, I noticed that the name resolving against globalnames is causing internal server errors and gateway timeouts after the introduction of the vernacular names. I've disabled the feature for now, and hoping to re-enabled when we understand how to fix it.

here's an example from the logs:

2015-08-20 12:31:51,812 [main] ERROR org.eol.globi.tool.LinkerGlobalNames - batch #1117 problem matching terms: [4475460
|Zilora ferruginea|4474176|Xerocomus communis|4474947|Xylota segnis|4475203|Zaraea fasciata|4474447|Xylaria filiformis|4
475477|Zodion|4474960|Xylota sylvarum|4475216|Zaraea lonicerae|4474973|Xylota tarda|4475480|Iberis|4475225|Zelleromyces 
stephensii|4475238|Zenillia libatrix|4474978|Xylota xanthocnema|4475490|Zodion cinereum|4474479|Xylaria guepinii|4475503
|Zoellneria eucalypti|4474473|Xylaria friesii|4475241|Archiearis notha|4474987|Xylotachina diluta|4474484|Xylaria hypoxy
lon|4474992|Xyphosia miliaria|4475512|Zoellneria rosarum|4475142|Carabus (Megodontus) violaceus|4475654|Zygorhizidium me
losirae|4474369|Xyela julii|4475139|Zaira cinerea|4475651|Aulacoseira italica subsp. subarctica|4475662|Kirchneriella ob
esa|4475147|Carabus (Morphocarabus) monilis|4475659|Zygorhizidium parvum|4475157|Pterostichus (Platysma) niger|4474641|X
yleborus dryographus|4474386|Xyela longula|4475667|Kirchneriella|4475164|Zalerion arboricola|4474911|Xylohypha ortmansia
e|4474395|Xylaplothrips fuliginosus|4475430|Galerucella|4474658|Xylechinus pilosus|4475181|Zalerion maritima|4474671|Xyl
etinus longitarsis|4474920|Xylohypha pinicola|4474420|Xylaria carpophila|4475445|Zignoëlla morthieri|4474934|Xylophaga p
raestans|4475703|Zygospermella striata|4474929|Xylophaga dorsalis|4475698|Zygospermella insignis|4475455|Zignoëlla slapt
onensis|4474681|Xylobolus frustulatus|4474426|Xylaria|4475194|Zaraea aenea|4475450|Zignoëlla rhytidodes|4475078|Zabrus t
enebrioides|4475585|Zwackhiomyces dispersus|4474307|Xerula radicata|4474829|Xylohypha ferruginosa|4474824|Xylocoris (Xyl
ocoris) formicetorum|4475350|Zeugophora turneri|4475095|Zacladus exiguus|4475607|Zwackhiomyces sphinctrinoides|4474320|X
estobium rufovillosum|4475602|Zwackhiomyces lacustris|4475100|Zacladus geranii|4474334|Xestophanes potentillae|4474591|X
yleborinus saxesenii|4475610|Leptogium turgidum|1632955|Xylota|4475111|Phillyrea latifolia|4475619|Clauzadea metzleri|44
74604|Xyleborus dispar|4474351|Xiphydria prolongata|4474858|Xylohypha nigrescens|4475114|Zaghouania phillyreae|4475371|Z
euzera pyrina|4475639|Zygogloea gemellipara|9397|Halictus|4475634|Zygiobia carpini|4474364|Xyela curva|4475644|Zygophial
a jamaicensis|4475269|Zeugophora flavicollis|4475264|Zenobiana prismatica|4475521|Zoopage thamnospira|4475535|Zoophthora
 anglica|4475530|Zoophagus insidians|816095|Phillyrea latifolia|4475544|Zoophthora radicans|4474790|Xylocleptes bispinus
|4475558|Zoothamnion arbuscula|4474273|Xerula caussei|4475043|Pisum sativum var. sativum|4474798|Cryptolestes ferrugineu
s|4474795|Xylocoris (Proxylocoris) galactinus|4475563|Zopfia rhizophila|4475572|Zopfiella erostrata|4474806|Xylocoris (X
ylocoris) cursitans|4474545|Xylaria oxyacanthae|4474803|Bitoma crenata|4474809|Rhizophagus|4474298|Xerula pudens]
org.eol.globi.service.PropertyEnricherException: Failed to query
        at org.eol.globi.service.GlobalNamesService.findTermsForNames(GlobalNamesService.java:74)
        at org.eol.globi.tool.LinkerGlobalNames.handleBatch(LinkerGlobalNames.java:66)
        at org.eol.globi.tool.LinkerGlobalNames.link(LinkerGlobalNames.java:45)
        at org.eol.globi.tool.Normalizer.linkTaxa(Normalizer.java:127)
        at org.eol.globi.tool.Normalizer.run(Normalizer.java:98)
        at org.eol.globi.tool.Normalizer.main(Normalizer.java:57)
Caused by: org.apache.http.client.HttpResponseException: Internal Server Error
        at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:67)
        at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:52)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
        at org.eol.globi.service.GlobalNamesService.queryForNames(GlobalNamesService.java:99)
        at org.eol.globi.service.GlobalNamesService.findTermsForNames(GlobalNamesService.java:71)
        ... 5 more

another one:

2015-08-20 12:30:51,543 [main] ERROR org.eol.globi.tool.LinkerGlobalNames - batch #1115 problem matching terms: [4471364
|Valsella amphoraria|4471621|Velutina plicatilis|4472135|Verticillium|4472128|Lecanora albescens|4471618|Hydrozoa|447238
6|Vibrissea guernisacii|4472140|Verticillium albo-atrum|4471373|Valsella clypeata|4471624|Styela coriacea|4471627|Veluti
na velutina|4471382|Valsella polyspora|4471632|Venturia carpophila|4471888|Venturia maculiformis|347923|Scilla|4472154|V
erticillium catenulatum|4471387|Valsella salicis|4471396|Vankya ornithogali|4470375|Bellevalia|4472167|Verticillium dahl
iae|4471905|Venturia minuta|4470127|Ustilago maydis|4471919|Venturia populina|4471914|Venturia palustris|4471147|Valsa i
ntermedia|4471156|Valsa laurocerasi|1052452|Salix matsudana|4471422|Vararia gallica|4472185|Verticillium insectorum|4470
534|Valsa ambiens|4472070|Veronaea botryosa|4470529|Valsa abrupta|4471555|Velutarina rufo-olivacea|4472079|Veronaea cari
cis|4472084|Veronaea carlinae|4472089|Veronaea parvispora|4471322|Valsaria insitiva|4470305|Ustilago tritici|4471086|Val
sa cypri|4471855|Venturia macularis|4472111|Verpa conica|4471081|Valsa ceuthospora|4472116|Verrucaria conturmatula|44703
20|Scilla sardensis|4470323|Ustilago vaillantii|4472371|Vespula (Vespula) austriaca|4472381|Vibrissea flavovirens|447212
1|Verrucaria latericola|4470330|Muscari botryoides|4272191|Valsa|4471355|Valsella adhaerens|4470470|Valdensia heterodoxa
|4472007|Venturia saliciperda|4472263|Physarum compressum|4472268|Physarum leucopus|4470217|Elytrigia juncea|4472277|Ste
monitis axifera|4471260|Valsaria anserina|4471516|Vasates pedicularis|4470493|Gaultheria|4471519|Acer saccharinum|447229
3|Vesiculomyces citrinus|4471527|Vasates retiolatus|4472288|Verticillium|4471265|Valsaria cincta|4471522|Vasates quadrip
edes|4471779|Venturia crataegi|4470504|Valsa abietis|4471784|Venturia ditricha|4471540|Vasates rigidus|4471793|Venturia 
fraxini|4472061|Venturiocistella ulicicola|4471550|Velutarina juniperi|4471806|Venturia geranii|4472056|Venturiocistella
 heterotricha|4471545|Vascellum pratense|4471941|Venturia pyrina|4471936|Venturia potentillae|4470414|Ustilentyloma bref
eldii|4470927|Valsa auerswaldii|4472200|Verticillium nubilum|4472213|Verticillium psalliotae|4471190|Valsa sordida|44711
85|Valsa pini|4471441|Climbing plants|4470431|Animalia|4470424|Ustilentyloma fluitans|4470936|Valsa ceratosperma|4471448
|Vararia ochroleuca|4471705|Venturia cerasi|4472218|Verticillium rexianum|4470438|Utricularia australis|4471974|Venturia
 rumicis|4472230|Ceratiomyxa fruticulosa|4472225|Arcyria nutans|4470441|Crustacea|4471209|Populus balsamifera|4470448|Ut
ricularia minor|4470194|Ustilago serpens|4471738|Venturia chlorospora|4470459|Diaptomus]
org.eol.globi.service.PropertyEnricherException: Failed to query
        at org.eol.globi.service.GlobalNamesService.findTermsForNames(GlobalNamesService.java:74)
        at org.eol.globi.tool.LinkerGlobalNames.handleBatch(LinkerGlobalNames.java:66)
        at org.eol.globi.tool.LinkerGlobalNames.link(LinkerGlobalNames.java:45)
        at org.eol.globi.tool.Normalizer.linkTaxa(Normalizer.java:127)
        at org.eol.globi.tool.Normalizer.run(Normalizer.java:98)
        at org.eol.globi.tool.Normalizer.main(Normalizer.java:57)
Caused by: org.apache.http.client.HttpResponseException: Gateway Time-out
        at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:67)
        at org.apache.http.impl.client.BasicResponseHandler.handleResponse(BasicResponseHandler.java:52)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:218)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:160)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:136)
        at org.eol.globi.service.GlobalNamesService.queryForNames(GlobalNamesService.java:99)
        at org.eol.globi.service.GlobalNamesService.findTermsForNames(GlobalNamesService.java:71)
        ... 5 more
jhpoelen commented 8 years ago

@dimus suggest to re-open this issue given the reported behaviors above.

dimus commented 8 years ago

@jhpoelen -- do these examples consistently break resolver?

jhpoelen commented 8 years ago

yep.

jhpoelen commented 8 years ago

@dimus I was able to reproduce and fix the issue on my end. It turned out to be a character encoding issue in the http post request that GloBI sends to the resolver. Thanks again for adding the vernacular names to the resolver results.