OpenGreekAndLatin / First1KGreek

XML files for the works in the First Thousand Years of Greek Project. Please see our Wiki on how to contribute.
https://opengreekandlatin.github.io/First1KGreek/
Creative Commons Attribution Share Alike 4.0 International
91 stars 85 forks source link

How to handle commentaries? #201

Open annettegessner opened 8 years ago

annettegessner commented 8 years ago

At some point, commentaries should include a reference to the correct CTS of the work that is commented on. Until we figure out how to do this, sections containing the original text and the comment (normally in a paragraph-tag each) will be numbered consecutively. See #162

Another idea - depending on the structure of the file in question - would be to either number the original textpart e.g. 1a and the comment 1b or to number each paragraph in the file consecutively and figure out what belongs to what afterwards.

PonteIneptique commented 8 years ago

I'd be happy to discuss this one with you :) (later this afternoon if possible)

annettegessner commented 8 years ago

Thanks, Thibault! One possible solution we just discussed was to put the reference to the original text in an attribute

    <p corresp="p. 89b23">p. 891b23 Τὰ ζητούμενά ἐστιν ἴσα τὸν ἀριθμὸν ὅσαπερ ἐπιστάμεθα.</p>

but we'll wait about some input on this matter before we decide.

annettegessner commented 8 years ago

List of issues of works that are commentaries:

138 (maybe, since theres no apparent quoted text from the Greek original)

146

147

148

149

150

154

159

162 (closed)

286 (about different Origenes files)

...

annettegessner commented 8 years ago

2 suggestions on how to mark up the "original" text in a commentary:

From the Homer Multitext, thanks to Lenny:

<div ana="1" n="2000" type="scholion">
                  <div type="symbol">
                     <p><add place="supralinear"><num value="13">ΙΓ</num></add>Ἄλλοι</p>
                  </div>
                  <div type="ref">
                     <p>urn:cts:greekLit:tlg0012.tlg001.msB:2.1</p>
                  </div>
                  <div type="comment">
                     <p><persName n="pers15"> Ζηνόδοτος </persName><rs type="waw">ὦλλοι</rs> γράφει· κακῶς· ἐλλέιπει γὰρ ὁ ποιητὴς τοῖς
                        ἄρθροις ἀεί‡</p>
                  </div>
               </div>

From the Hyperdonat project, thanks to Thibault:

<whatever container tag [probably div]>
<p or div n="1"><quote corresp="urn of line 1 of the Aeneid">Arma virumque canno</quote> I am the comment on this text</p or div>
</whatever>

I prefer the second solution since it is simpler and the URN of the quoted text does not appear in the visible text.

annettegessner commented 8 years ago

One example of how it could look like: tlg9004.tlg001.opp-grc1, see #162:

<div type="textpart" subtype="chapter" n="3">
(...)
<p><quote corresp="URN of Aristotle, Posterior Analytics, 89b23">p. 891b23 Τὰ ζητούμενά ἐστιν ἴσα τὸν ἀριθμὸν ὅσαπερ ἐπιστάμεθα.</quote>
<lb n="5"/> Ὅτι φυσει τροτερον το εἰ ἔστι· του ὅτι ἔστιν· ὁ γὰρ τὸ ὅτι
χει τόδε τῷδε ζητῶν, ὡς ὁμολογούμενον ἤδη εἶναι τὸ ὑποκείμενον περὶ
οὖ ζητεῖ, &lt;ζητεῖ&gt; εἰ ὑπάρχει αὐτῷ τι ἢ μή. τὸ δὲ ὅσαπερ ἐπιστάμεθα <lb n="5"/>
ἀντὶ τοῦ ὅσων ἐπιστήμην λαμβάνομεν᾿· οὐ γὰρ ἃ ἐπιστάμεθα ζητοῦμεν,
ἀλλ᾿ ἅ γνῶναι θέλομεν.</p>
(...)
</div>

Problem is determining the correct URN: At this point it seems, we only have an entry in the Perseus Catalog, but the text is neither on Perseus nor on GitHub, thus we don't have a CTS-compliant version of it. (Speaking of which: This way of citing Aristotle may be canonical, but it's not logical, so we will think about the way we want to cite it. Same problem as with Plato.) Furthermore as seen in this example, the OCR tends to be bad with strings like this citation, so that I had to look at the scan to see, that the way this commentator cited Aristotle is "89b23" and not "891b23". So just taking the citation from the XML can be critical.

sonofmun commented 8 years ago

I agree with @annettegessner that the second method is to be preferred. I don't like having the URN in the text and would much rather have it on an attribute. As for the URN, we can at least have the work URN in there and, for now, use whatever citation scheme is in the text itself, unless, of course, we have a better one. If there is no discernible citation scheme in the commentary, however, we will probably want to use some sort of text-reuse detection algorithm to automatically find citation candidates instead of trying to find the citations by hand.

jduff-chs commented 8 years ago

Would we also like to cite commentaries which do not reprint the original text with URNs? In many of the fragments of Origenes's commentaries and homilies, editors have given the biblical citation, but do not reprint the actual biblical text. It would be possible to go through and encode these citations as URN, but <quote> tags don't seem to apply in that case. Does anyone have any recommendations?

For example:

<div type="textpart" subtype="fragment" n="3">

<head>Zu Joh. 1, 5.</head>
<p>Τὸ θεολογούμενον φῶς λυτικόν ἐστι πάσης σκοτίας καὶ ἀγνοίας...
PonteIneptique commented 8 years ago

Could you give an example a little longer ? I guess <q> is the right way to go

(quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used. [3.3.3 Quotation] http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-q.html

jduff-chs commented 8 years ago

I wasn't very clear, I'm sorry! Each one of these commentary fragments has a biblical citation, so that fits "mentioned but not used." I gave an example of the beginning of one of those fragments, where the citation, Zu Joh. 1, 5. was given in a header above the text of the fragment, and the text is contained in a series of <p> tags.

To restate the problem, I am looking for a way to encode a citation for every div as a URN, instead of these citation headers.

If I'm interpreting the documentation right, one method would be to enclose each fragment in a

<q type="mentioned"></q>

tag, between div and p, like so:

<div type="textpart" subtype="fragment" n="3">
<q type="mentioned" corresp="URN of John 1,5">
<p>Τὸ θεολογούμενον φῶς λυτικόν ἐστι πάσης σκοτίας καὶ ἀγνοίας...</p>
</q>
</div>

Alternatively, I'm wondering if we could just add the corresp attribute to the div?

e.g.

<div type="textpart" subtype="fragment" n="3" corresp="URN of John 1,5">
<p>Τὸ θεολογούμενον φῶς λυτικόν ἐστι πάσης σκοτίας καὶ ἀγνοίας...</p>
</div>
PonteIneptique commented 8 years ago

I'd prefer <q> the way you propose, if it is compliant (other wise put the q in the p). It should make more sense and is more correct :) So option 1 !

jduff-chs commented 8 years ago

<q> is compliant between <div> and <p>! I'll add this layer to these texts this week.

Perhaps a silly question: what are the best URNs to use for biblical citations?

annettegessner commented 8 years ago

This is no silly question at all. Since there are a lot of Bible versions, it would - in theory - depend on the specific version this author chose to cite. Do we already have a solution for matters like these, @sonofmun ?

sonofmun commented 8 years ago

The URN for the New Testament text group is urn:cts:greekLit:tlg0031, for the Septuagint it is urn:cts:greekLit:tlg0527. Each biblical book, then, has its own URN, e.g., the Gospel of Matthew is urn:cts:greekLit:tlg0031.tlg001. So if you wanted to cite Matthew 1:1 in a commentary, the URN would be urn:cts:greekLit:tlg0031.tlg001:1.1. I think we decided to give work-level URNs instead of edition-level URNs to the cited texts in commentaries.

jduff-chs commented 8 years ago

Great, thank you @annettegessner and @sonofmun, that should be very straightforward to accomplish. I'll start working on adding those work-level URNs tomorrow.

gcelano commented 7 years ago

I found a lot of quotations which are not encoded properly (i.e., no markup for them). See for example,

https://raw.githubusercontent.com/OpenGreekAndLatin/First1KGreek/master/data/tlg0018/tlg001/tlg0018.tlg001.opp-grc1.xml

and search for "(Gen. 1,2)"

This is a problem when trying to tokenize.

sonofmun commented 7 years ago

Hi @gcelano, when you notice things like this, it would be great if you could correct them and then do a pull request with the changes.

gcelano commented 7 years ago

@sonofmun, this is a problem comprising a really huge amount of various instances, whose correction requires an agreed-on solution, a common strategy, and dedicated work, which at the moment I cannot provide, working on solving the many issues related to tokenization and pos-tagging of https://github.com/gcelano/LemmatizedAncientGreekXML

sonofmun commented 7 years ago

Hi @gcelano you have already helped a lot finding this instance. If you do find future problems along these same lines, it would be great if you could either note the files and the problems here or, even better, open a new issue for the file with a short description of what is breaking.