biblicalhumanities / greek-new-testament

Greek New Testament
45 stars 18 forks source link

Paragraph divisions #5

Open e-ehrhardt opened 9 years ago

e-ehrhardt commented 9 years ago

I'm curious if there is any interest in reflecting intermediate discourse organization between the "sentence" and "book" level, namely at the paragraph level as the SBLGNT marks it?

Such information could be useful for searches along the lines of "Find a paragraph/pericope which contains XYZ", where XYZ may be certain vocabulary characteristics, grammatical constructions, etc.

Just a suggestion. Or if such data could be obtained another way, I'd be happy to hear about that as well.

jonathanrobie commented 9 years ago

Yes, I think that would be a good idea. I'll discuss this with Randall and Andi.

You can find the paraph boundaries in the original SBLGNT XML files: http://sblgnt.com/download/sblgnt.osis.zip

But it would be more convenient if they were preserved in our trees.

e-ehrhardt commented 9 years ago

Although places where the paragraph boundaries correspond to locations within single syntax tree sentences may cause problems.

E.g. 1 John 3:11 has a paragraph boundary before Ὅτι, but that is within the tree for 1Jn3:10:14-3:11:12 in the annotated trees.

jonathanrobie commented 9 years ago

Good point - maybe Randall and Andi removed the paragraphs to have more freedom in their analysis.

Adding them back as either (1) milestones or (2) out of line markup may still be useful, but probably not in the original trees if the paragraph boundaries disagree with the sentence boundaries.

rkjtan commented 9 years ago

Because the trees are generated from parsing words, phrases, & clauses together, the longer the string of text, the more processing power is needed because of the multiplication of possible trees. Even longer sentences can take longer than is practical for tree parsing purposes. So, paragraph or discourse level annotations were not incorporated into the current system for generating the syntax trees. Moreover, paragraph & discourse level annotations are very much uncertain & would require an entirely new, well-thought-out project. However, it is definitely possible to add such annotations on top of the trees & we would like to see others build such levels of annotations on top of the trees even before we do (if we do).

e-ehrhardt commented 9 years ago

SBLGNT states that it "capitalizes (1) the first word of a paragraph; (2) the first word of direct speech; and (3) proper nouns." Even though the formatted version of SBLGNT sometimes appears to indent a new paragraph in the middle of a sentence at the beginning of some direct speech, the overall heuristic of making use of sentence-initial capitals is useful.

For my own purposes, I just added a wrapper around each series of nodes, where sentences whose first word (by morphID, not tree order) was capitalized marked the beginning of each new paragraph. I felt it was an improvement over searching by book chapter, as it was useful for the searches I was conducting, and didn't increase processing time much.

jtauber commented 9 years ago

I'd actually love to have paragraph (and pericope) boundaries in MorphGNT too, even if just described out-of-band. I guess that can be fairly easily generated from the original SBLGNT files and I don't have to worry about concurrence, overlapping markup either.

jonathanrobie commented 9 years ago

I suspect paragraph and pericope boundaries could be done as inline milestones without much affect on processing, it's not a lot of nodes, especially compared to the number of nodes used for other things.

The real arguments for doing it out-of-band are (a) supporting multiple interpretations, and (b) allowing work on these boundaries to be done independently of work on the trees per se. Then again, you could argue that syntax trees should be done out-of-band on the same grounds. Some of this is a matter of taste.

jtauber commented 7 years ago

See https://github.com/biblicalhumanities/Nestle1904/issues/5 and my comment there.

I will note in response to https://github.com/biblicalhumanities/greek-new-testament/issues/5#issuecomment-78584630 that SBLGNT is slightly inconsistent with capitalization in relation to paragraph breaks.

jonathanrobie commented 7 years ago

I've done some work on this - see https://github.com/biblicalhumanities/Nestle1904/issues/5.

jonathanrobie commented 7 years ago

SBLGNT is available in XML with paragraphs. See their download page.

jtauber commented 7 years ago

Yes, that's what morphgnt API uses.

jonathanrobie commented 7 years ago

Nestle 1904 is also available in XML with paragraphs now:

https://github.com/biblicalhumanities/Nestle1904

I plan to maintain the paragraphs there as the upstream. In a distribution, I may well merge them in to the treebanks, but I'm not sure if that is a good idea or not.