Closed gregorycrane closed 6 years ago
That is odd. I'll investigate what's happened here.
Ahh, it's looks like it's a problem with the passage references in Giuseppe's XML.
For example, line 171 is just given the reference 171
but line 172 is given the ref 1.172
, 173 the ref 1.173
and so on for many more lines. This effectively places all those lines "under" passage 1
.
The fault doesn't ultimately lie in Giuseppe's XML, though. It likely goes back to the underlying TEI and the way Capitains is extracting hierarchy from it.
I could correct this in our database, but the problem would still exist upstream in Giuseppe's repo. Giuseppe could fix it in his XML but the problem would still exist further upstream in the sources he used. Seems we need to sort out the workflow for this (as this specific problem in Soph. Ajax will affect other things and there will be other cases just like this)
Looks like Antigone suffers from a similar issue. It would be great to fix this because otherwise the usefulness of the word tool (and specially newly integrated word list in Scaife) is diminished. For example:
https://lk353.eu1.eldarioncloud.com/reader/urn:cts:greekLit:tlg0011.tlg002.perseus-grc2:1/
is (correctly) just the first line of Antigone but the Word List on the right is (obviously) for considerably more. The latter is based on
https://gu658.us1.eldarioncloud.com/word-list/urn:cts:greekLit:tlg0011.tlg002.perseus-grc2:1/
which shows 675 tokens for "Antigone 1" because of some issue with Giuseppe's XML (which again, I doubt is Giuseppe's fault but some issue in the citation scheme in the source XML).
@gregorycrane Who should I bring this up with besides Giuseppe? Lisa? Matt?
Just to update from the email discussion: I'm now more confident that the problem does NOT lie in the source text XML but just in Giuseppe's references. It seems Capitains is correctly honouring the citation mapping but Giuseppe's lemmatisation is not. As a workaround, I may just try to manually correct the database in GVT with some basic heuristics.
This has been fixed with a bunch of manual corrections to Giuseppe's bad references. Corrections may not be comprehensive across entire corpus but they should cover core.
i don't think there are 1,496 token in Soph. Ajax 1-10 https://gu658.us1.eldarioncloud.com/word-list/urn:cts:greekLit:tlg0011.tlg003.perseus-grc2:1-10/?page=all