Open e-ehrhardt opened 9 years ago
Yes, I think that would be a good idea. I'll discuss this with Randall and Andi.
You can find the paraph boundaries in the original SBLGNT XML files: http://sblgnt.com/download/sblgnt.osis.zip
But it would be more convenient if they were preserved in our trees.
Although places where the paragraph boundaries correspond to locations within single syntax tree sentences may cause problems.
E.g. 1 John 3:11 has a paragraph boundary before Ὅτι, but that is within the tree for 1Jn3:10:14-3:11:12 in the annotated trees.
Good point - maybe Randall and Andi removed the paragraphs to have more freedom in their analysis.
Adding them back as either (1) milestones or (2) out of line markup may still be useful, but probably not in the original trees if the paragraph boundaries disagree with the sentence boundaries.
Because the trees are generated from parsing words, phrases, & clauses together, the longer the string of text, the more processing power is needed because of the multiplication of possible trees. Even longer sentences can take longer than is practical for tree parsing purposes. So, paragraph or discourse level annotations were not incorporated into the current system for generating the syntax trees. Moreover, paragraph & discourse level annotations are very much uncertain & would require an entirely new, well-thought-out project. However, it is definitely possible to add such annotations on top of the trees & we would like to see others build such levels of annotations on top of the trees even before we do (if we do).
SBLGNT states that it "capitalizes (1) the first word of a paragraph; (2) the first word of direct speech; and (3) proper nouns." Even though the formatted version of SBLGNT sometimes appears to indent a new paragraph in the middle of a sentence at the beginning of some direct speech, the overall heuristic of making use of sentence-initial capitals is useful.
For my own purposes, I just added a
I'd actually love to have paragraph (and pericope) boundaries in MorphGNT too, even if just described out-of-band. I guess that can be fairly easily generated from the original SBLGNT files and I don't have to worry about concurrence, overlapping markup either.
I suspect paragraph and pericope boundaries could be done as inline milestones without much affect on processing, it's not a lot of nodes, especially compared to the number of nodes used for other things.
The real arguments for doing it out-of-band are (a) supporting multiple interpretations, and (b) allowing work on these boundaries to be done independently of work on the trees per se. Then again, you could argue that syntax trees should be done out-of-band on the same grounds. Some of this is a matter of taste.
See https://github.com/biblicalhumanities/Nestle1904/issues/5 and my comment there.
I will note in response to https://github.com/biblicalhumanities/greek-new-testament/issues/5#issuecomment-78584630 that SBLGNT is slightly inconsistent with capitalization in relation to paragraph breaks.
I've done some work on this - see https://github.com/biblicalhumanities/Nestle1904/issues/5.
SBLGNT is available in XML with paragraphs. See their download page.
Yes, that's what morphgnt API uses.
Nestle 1904 is also available in XML with paragraphs now:
https://github.com/biblicalhumanities/Nestle1904
I plan to maintain the paragraphs there as the upstream. In a distribution, I may well merge them in to the treebanks, but I'm not sure if that is a good idea or not.
I'm curious if there is any interest in reflecting intermediate discourse organization between the "sentence" and "book" level, namely at the paragraph level as the SBLGNT marks it?
Such information could be useful for searches along the lines of "Find a paragraph/pericope which contains XYZ", where XYZ may be certain vocabulary characteristics, grammatical constructions, etc.
Just a suggestion. Or if such data could be obtained another way, I'd be happy to hear about that as well.