alpheios-project / arethusa

Arethusa: Annotation Environment
http://sosol.perseids.org/tools/arethusa
MIT License
34 stars 26 forks source link

Widget API: findWord needs to handle normalized text #817

Closed balmas closed 4 years ago

balmas commented 4 years ago

As reported by @nevenjovanovic at perseids-publications/treebank-template#42

Using the treebanks https://nevenjovanovic.github.io/treebank-template/fragment-SBJ-10-1-10/11 and https://nevenjovanovic.github.io/treebank-template/fragment-SBJ-10-1-10/12 (the same sentence, without the treebank diagram and with a partial one), at our site integrated with Alpheios and treebanks http://croala.ffzg.unizg.hr/eklogai/theca/10-sent-sbj/ the words 3, 5, 9-13, 20, 24, 26-28, 30 do not show disambiguation data, even those words which have multiple possible morphological identification. Don't understand why is this happening, because the morphological data and lemmata are present in the source, for example:

<word id="24" form="ἀήθεις" lemma="ἀήθης" postag="a-p---ma-" relation="" head=""/>
<word id="25" form="," lemma="punc1" postag="u--------" relation="AuxX" head="0"/>
<word id="26" form="τροφῆς" lemma="τροφή" postag="n-s---fg-" relation="" head=""/>
<word id="27" form="δ’" lemma="δέ" postag="d-------p" relation="" head=""/>
<word id="28" form="ἡμέρου" lemma="ἥμερος" postag="a-s---fgp" relation="" head=""/>
<word id="29" form="παντελῶς" lemma="παντελῶς" postag="d-------p" relation="" head=""/>
<word id="30" form="ἀνεννοήτους" lemma="ἀνεννόητος" postag="a-p---ma-" relation="" head=""/>
<word id="31" form="." lemma="punc1" postag="u--------" relation="AuxK" head="0"/>

On the other hand, at another page of the same site: http://croala.ffzg.unizg.hr/eklogai/theca/ps-xen-athen-2-14-2/ all words are disambiguated according to source https://nevenjovanovic.github.io/treebank-template/ps-xen-2-14-2-16/1 I use mkdocs to produce the pages, but the sentences aligned with treebanks are HTML fragments. Tested in Firefox and Chrome.

From @balmas :

This is caused by an inconsistency in how we handle encoding in Alpheios and Arethusa. Alpheios sends an NFC normalized version of the word text to Arethusa to find in the sentence, but Arethusa does not on its end apply the same normalization to the text in the treebank and thus it's not finding a match. I had taken this into account on the Alpheios end when data is received from Arethusa, but not in Arethusa when data is received from Alpheios.

So, this is a fix I need to make in Arethusa and I'll transfer the issue there.

balmas commented 4 years ago

this fix is deployed in https://github.com/perseids-tools/arethusa-widget/releases/tag/v2.1.0 . The treebank-template master branch has been updated to include the new version of Arethusa and this is deployed at trees.alpheios.net.

It can be tested in the alpheios treebank test page at https://alpheios-misc-dev.s3.us-east-2.amazonaws.com/treebank-test-page/test.html (see the test under "User Reports"). This test is using a fork of Neven's repo which has been updated with the newer treebank-template and arethusa source.

monzug commented 4 years ago

In FF, I do not get the treebank icon in the pop-up for ἀήθεις. In Chrome, I get the treebank icon but the diagram is empty. see screenshot - is there anything else that I should check here? anoeis

balmas commented 4 years ago

In Firefox on Mac or Windows? It's working for me on FF on Linux, although the first time I loaded the page I was getting errors retrieving resources, which worries me a little.

As for the tree, yes that's expected, because no relationships are coded in that treebank file. That is the source of the request for alpheios-project/alpheios-core#484 which will be available with the next build.

monzug commented 4 years ago

Yes, I finally got the treebank icon in FF on Windows too. it took a while to load the diagram icon on the pop-up but finally I was able to open the diagram too. Initially I kept getting the resource error

FF-treebank