Closed fcbond closed 12 months ago
Perhaps I could tackle this; where should I start looking in the source code?..
Hi @fcbond do you have LTDB running in a public endpoint? So we can see what your comment means?
@arademaker Francis's comment is actually my email to him :).
LTDB currently displays lemmas instead of surface forms in corpus examples; that's rather inconvenient for non-English languages...
I got as far as setting up Pycharm as the debugger:
But I can't figure out where the relevant code is, for the examples that get displayed. Any help? Where should I try to set the breakpoint?
The lex-rule is shown by the route
https://github.com/fcbond/ltdb/blob/dbbc9f9406b6e8a864e212d695d97fcacd6a16d8/web/routes.py#L102-L104
Which somewhat confusingly is rendered by the lextype template, which displays the sentences as such: https://github.com/fcbond/ltdb/blob/dbbc9f9406b6e8a864e212d695d97fcacd6a16d8/web/templates/lextype.html#L80-L89
The problem is that the words in the sents
dictionary are the terminals of the trees:
https://github.com/fcbond/ltdb/blob/dbbc9f9406b6e8a864e212d695d97fcacd6a16d8/scripts/gold2db.py#L111C13-L120C67
In grammars using an external morphological analyzer like the SRG, these are probably the lemmas. I don't know where the surface form is stored .
Ideally we should be able to link back to cfrom-cto, and then use the original sentence, ...
See if it can be fixed like this maybe? https://github.com/fcbond/ltdb/pull/39
However, that only fixes the issue in the examples (not the trees). Which is maybe ok. I suspect perhaps the trees come from ACE directly? ACE+LUI displays only lemmas, too. And I am not willing to try and fix that for the moment. In any case, having the surface forms in the examples is already much better, and reading the tree is easy with the example displayed right above it.
That being said, if you think surface forms can be passed to the code displaying the trees, let me know! Perhaps also the DMRS?..
LTDB currently displays lemmas in examples; does that depend on LTDB code or on something else (ACE output)?..
Would be much more convenient to see the examples with surface forms... Especially when they get longer, having just lemmas in Spanish can be confusing :).