Open gcelano opened 1 year ago
@gcelano Here there are two separate editions indicated in both the metadata and in Scaife.
I believe there can be partial works and other differences that would negate your statement (they would not be "exactly the same work"). grc1, grc2 etc. always indicates a different edition. There is no special indication in the URN itself that suggests a work is partial or incomplete. (I do not think there is anything split in OGL any longer).
@AlisonBabeu thoughts?
A glance at the word counts can show differences across editions. https://opengreekandlatin.github.io/First1KGreek/
I am trying to get only one edition per work, but at the same time I do not want to filter out works split into two files. Since the files are process automatically, there would be no way for me to distinguish between duplicates and split works, if this is not encoded somewhere (name of the file or maybe within __cts__.xml
)
hi @lcerrato and @gcelano (nice to hear from you, hope all is well!). As I've been going through the list of works in the Scaife viewer in the last year or so, I've been combining files when I have found a work spread across more than one TEIXML file (that has only happened once or twice, such as with the letters of Augustine!). If a file has only part of a work rather than say the whole work, we've been indicating that typically in the header metadata and in the cts.xml file, not in the URN.
For example with Pappus of Alexandria's work Synagoge, we only have Book 1 (only first volume was digitized). We've captured that fact in the cts_metadata where it illustrates: Synagoge, Book 1 in the label.
`
Hi @AlisonBabeu, thank you for the explanation! In general, I think that the more explicit such information is the better. Maybe isolating it in an attribute, or even better in the file name (for example, *.1st1K-grc1_part1.xml
) would make it easier to distinguish from files with different editions but same text
Hi Giuseppe -- Are you doing a new lemmatization/treebanking of the Greek?
P.S. It is great to see you on this thread. I know you have been active in other areas but it is always wonderful to see you!
Hi @gregorycrane! I have tokenized and morphosyntactically annotated all texts of canonical-Greek
and first1KGreek
here (about 34M tokens!) . The plan is to newly generate these data/annotations every time a major release of the Perseus texts is available, in that all annotations are machine learning generated. However, as the size of the corpus is huge (at the moment, about 10GB), I have yet to figure out where/how best to release them.
There a number of files deriving from the same printed edition, such as
tlg0057.tlg034.1st1K-grc1
andtlg0057.tlg034.1st1K-grc2
, which apparently have slightly different markup: is there a reason for that? More importantly, can we assume that every time the first part of acts:urn
("`tlg0057.tlg034") is the same, the corresponding files contain exactly the same work and not, for example, part of it (e.g., one work has been split into two parts, being too long)?