deep-philology / DeepVocabulary

vocabulary server (mostly for Perseus but also standalone)
https://gu658.us1.eldarioncloud.com
MIT License
3 stars 0 forks source link

aggregation bug? #69

Closed gregorycrane closed 6 years ago

gregorycrane commented 6 years ago

i don't think there are 1,496 token in Soph. Ajax 1-10 https://gu658.us1.eldarioncloud.com/word-list/urn:cts:greekLit:tlg0011.tlg003.perseus-grc2:1-10/?page=all

jtauber commented 6 years ago

That is odd. I'll investigate what's happened here.

jtauber commented 6 years ago

Ahh, it's looks like it's a problem with the passage references in Giuseppe's XML.

For example, line 171 is just given the reference 171 but line 172 is given the ref 1.172, 173 the ref 1.173 and so on for many more lines. This effectively places all those lines "under" passage 1.

jtauber commented 6 years ago

The fault doesn't ultimately lie in Giuseppe's XML, though. It likely goes back to the underlying TEI and the way Capitains is extracting hierarchy from it.

jtauber commented 6 years ago

I could correct this in our database, but the problem would still exist upstream in Giuseppe's repo. Giuseppe could fix it in his XML but the problem would still exist further upstream in the sources he used. Seems we need to sort out the workflow for this (as this specific problem in Soph. Ajax will affect other things and there will be other cases just like this)

jtauber commented 6 years ago

Looks like Antigone suffers from a similar issue. It would be great to fix this because otherwise the usefulness of the word tool (and specially newly integrated word list in Scaife) is diminished. For example:

https://lk353.eu1.eldarioncloud.com/reader/urn:cts:greekLit:tlg0011.tlg002.perseus-grc2:1/

is (correctly) just the first line of Antigone but the Word List on the right is (obviously) for considerably more. The latter is based on

https://gu658.us1.eldarioncloud.com/word-list/urn:cts:greekLit:tlg0011.tlg002.perseus-grc2:1/

which shows 675 tokens for "Antigone 1" because of some issue with Giuseppe's XML (which again, I doubt is Giuseppe's fault but some issue in the citation scheme in the source XML).

jtauber commented 6 years ago

@gregorycrane Who should I bring this up with besides Giuseppe? Lisa? Matt?

jtauber commented 6 years ago

Just to update from the email discussion: I'm now more confident that the problem does NOT lie in the source text XML but just in Giuseppe's references. It seems Capitains is correctly honouring the citation mapping but Giuseppe's lemmatisation is not. As a workaround, I may just try to manually correct the database in GVT with some basic heuristics.

jtauber commented 6 years ago

This has been fixed with a bunch of manual corrections to Giuseppe's bad references. Corrections may not be comprehensive across entire corpus but they should cover core.