OpenGreekAndLatin / First1KGreek

XML files for the works in the First Thousand Years of Greek Project. Please see our Wiki on how to contribute.
https://opengreekandlatin.github.io/First1KGreek/
Creative Commons Attribution Share Alike 4.0 International
91 stars 85 forks source link

Same work twice #2740

Open gcelano opened 1 year ago

gcelano commented 1 year ago

There a number of files deriving from the same printed edition, such as tlg0057.tlg034.1st1K-grc1 and tlg0057.tlg034.1st1K-grc2, which apparently have slightly different markup: is there a reason for that? More importantly, can we assume that every time the first part of a cts:urn ("`tlg0057.tlg034") is the same, the corresponding files contain exactly the same work and not, for example, part of it (e.g., one work has been split into two parts, being too long)?

lcerrato commented 1 year ago

@gcelano Here there are two separate editions indicated in both the metadata and in Scaife.

image

I believe there can be partial works and other differences that would negate your statement (they would not be "exactly the same work"). grc1, grc2 etc. always indicates a different edition. There is no special indication in the URN itself that suggests a work is partial or incomplete. (I do not think there is anything split in OGL any longer).

@AlisonBabeu thoughts?

A glance at the word counts can show differences across editions. https://opengreekandlatin.github.io/First1KGreek/

gcelano commented 1 year ago

I am trying to get only one edition per work, but at the same time I do not want to filter out works split into two files. Since the files are process automatically, there would be no way for me to distinguish between duplicates and split works, if this is not encoded somewhere (name of the file or maybe within __cts__.xml)

AlisonBabeu commented 1 year ago

hi @lcerrato and @gcelano (nice to hear from you, hope all is well!). As I've been going through the list of works in the Scaife viewer in the last year or so, I've been combining files when I have found a work spread across more than one TEIXML file (that has only happened once or twice, such as with the letters of Augustine!). If a file has only part of a work rather than say the whole work, we've been indicating that typically in the header metadata and in the cts.xml file, not in the URN.

For example with Pappus of Alexandria's work Synagoge, we only have Book 1 (only first volume was digitized). We've captured that fact in the cts_metadata where it illustrates: Synagoge, Book 1 in the label.

`

Synagoge, Book 1 Pappus Alexandrinus. Pappi Alexandrini collectionis quae supersunt, Volume 1. Hultsch, Friedrich, editor. Leipzig: Weidmann, 1876.` There are a few places where URNS have been used inconsistently and 1st1K-grc2 or 1st1K-lat1 have been used to represent supplementary parts of a printed edition (say a preface, intro, index, etc.) but I've been changing those as I find them so we are intellectually consistent across the whole collection. In general, if there is a 1st1K-grc1 and 1st1K-grc it means there are two editions of a work.
gcelano commented 1 year ago

Hi @AlisonBabeu, thank you for the explanation! In general, I think that the more explicit such information is the better. Maybe isolating it in an attribute, or even better in the file name (for example, *.1st1K-grc1_part1.xml) would make it easier to distinguish from files with different editions but same text

gregorycrane commented 1 year ago

Hi Giuseppe -- Are you doing a new lemmatization/treebanking of the Greek?

P.S. It is great to see you on this thread. I know you have been active in other areas but it is always wonderful to see you!

gcelano commented 1 year ago

Hi @gregorycrane! I have tokenized and morphosyntactically annotated all texts of canonical-Greek and first1KGreek here (about 34M tokens!) . The plan is to newly generate these data/annotations every time a major release of the Perseus texts is available, in that all annotations are machine learning generated. However, as the size of the corpus is huge (at the moment, about 10GB), I have yet to figure out where/how best to release them.