OpenGreekAndLatin / First1KGreek

XML files for the works in the First Thousand Years of Greek Project. Please see our Wiki on how to contribute.
https://opengreekandlatin.github.io/First1KGreek/
Creative Commons Attribution Share Alike 4.0 International
92 stars 86 forks source link

Inconsistent or problematic div tags #2805

Open fffoivos opened 1 month ago

fffoivos commented 1 month ago

I am looking to perform statistical analysis on the Greek corpus' annotation of paragraphs (type and subtype tags of divs). I have found quit a large variety of tags that do not seem to be clearly defined, as well as some spelling mistakes.

Is there a glossary of terms?

I am attaching a JSON file that you might find useful in case you want to create one. It contains all type - subtype pairs as well as an example path in data/ for each.

div_hierarchy.json

lcerrato commented 1 month ago

@fffoivos Thank you. I can take a look at some of these, particularly the misspellings, but there is no glossary available. I may not be able to account for some of the prior encoding choices. Many of these are not going to enable capture or display as written.

fffoivos commented 4 weeks ago

@lcerrato Thank you for looking into this. I can submit a PR focusing specifically on fixing clear typographical errors in the div attributes such as:

without changing any of the encoding choices.

lcerrato commented 3 weeks ago

@fffoivos Where the errors were obvious, the changes were made. There were also some structural changes where commentary was appended at the end of an edition. Please let me know what issues remain as I may have missed something.

fffoivos commented 3 weeks ago

@lcerrato I made some suggestions in #2806 with spelling corrections

lcerrato commented 3 weeks ago

@fffoivos Just reopening until I hear an update on the results you see now.