(urn:cts:greekLit:tlg0544.tlg001) Sextus Empiricus translation ingestion

lcerrato commented 4 months ago

adding contribution from @WDRoush

lcerrato commented 4 months ago

@AlisonBabeu What should the URN be?

lcerrato commented 4 months ago

tlg0544.tlg001.1st1K-eng1 suggested

lcerrato commented 4 months ago

@WDRoush @gregorycrane

You want to get started with a header. A template is linked in the wiki.

Note: I updated the above wiki page with a better template.

[ ] assign URN to file
[ ] select default structure
[ ] complete header
[ ] refine markup (see below TBD)
[ ] request metadata creation
[ ] check validation

lcerrato commented 4 months ago

@WDRoush

Preliminary recommendations.

[ ] Look for spacing around hyphens, hyphens that should be em dashes
[ ] Straight quotes should be curly/smart quotes. (Be careful not to change those within tags)
[ ] Add resp attribute to notes <note resp="Loeb" ... >

Q: Were print page numbers captured? I don't see them. Consider a line wrap within paragraphs for ease of reading.

lcerrato commented 4 months ago

Address silent changes and best practices.

WDRoush commented 4 months ago

@WDRoush

Preliminary recommendations.

[ ] Look for spacing around hyphens, hyphens that should be em dashes

[ ] Straight quotes should be curly/smart quotes. (Be careful not to change those within tags)

[ ] Add resp attribute to notes <note resp="Loeb" ... >

Q: Were print page numbers captured? I don't see them. Consider a line wrap within paragraphs for ease of reading.

Got these three things done, and got some headway on the header, but I might need some help completing that. I put the updated version on Box: https://tufts.box.com/s/swd6ynxe8hxjy34p1vtgyonf74a5bppd

A: I removed the print pages, but I can add them back if that is preferred. I will work on line wrapping as I work on other best practices.

lcerrato commented 4 months ago

@WDRoush
The new file has been uploaded. As the work is large, you may not find it worthwhile to add back print page notations at this phase. I have not made any changes.

You can see the differences in your versions now in the pull request: https://github.com/OpenGreekAndLatin/First1KGreek/pull/2792/commits/ee51201187b1994e3dd88ad4b38e2c0344cea28d

One suggestion: You may want to think about working on a fork of this repository to easily push your changes. (Another option is to attach files within this issue.) Depending on your project goals, that might be worth it in the long run. @msaxton @AlisonBabeu might have good pointers to some doc to get you started on that aspect of using GitHub if you are interested.

lcerrato commented 3 months ago

@WDRoush Have you had a chance to look at an existing header or the example template? I find it easier to use a passing file rather than trying to recreate, as this inevitably results in missing info.

There are a few details about the file that we need to fill in (like a checklist) and using an existing header is easiest.

lcerrato commented 3 months ago

@WDRoush
<title xml:lang="eng">Outlines of Pyrrhonism</title> The language attribute is not needed here as the header is already tagged as English. <author xml:lang="eng">Sextus Empiricus</author> Conversely, the language attribute is wrong here, as this is a Latin name.

<funder> Did Tisch Library provide direct funding for the work?

The responsibility statement is boilerplate and should read:

                    <respStmt>
                    <resp>Published original versions of the electronic texts</resp>
                    <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
                    <persName role="principal">Gregory Crane</persName>
                    <persName role="principal">Leonard Muellner</persName>
                    <persName role="principal">Bruce Robertson</persName>
                    </respStmt>

Following that, you must indicate responsibility to individuals as follows:

                       <respStmt>
                        <persName>William Roush</persName>
                        <orgName>Tufts University</orgName>
                        <resp>Digital conversion and editing</resp>
                    </respStmt>

        <respStmt>
          <persName>Lisa Cerrato</persName>
          <orgName>Perseus Digital Library</orgName>
          <resp>Digital editor</resp>
        </respStmt>

No one else (Crane, Saxon, Babeu) would typically require credit here as we did not do the digital editing.

lcerrato commented 3 months ago

publication statement and source description are mandatory.

                   <publicationStmt>
                    <publisher>Trustees of Tufts University</publisher>
                    <publisher>Open Greek and Latin</publisher>
                    <pubPlace>Medford, MA</pubPlace>
                    <authority>Perseus Digital Library</authority>
                    <date when="2024-09-01"/>
                    <idno type="filename">tlg0544.tlg001.1st1K-eng1.xml</idno>
                    <availability>
                  <licence target="https://creativecommons.org/licenses/by-sa/4.0/">Available under a Creative Commons Attribution-ShareAlike 4.0 International License</licence>
                  </availability>
                </publicationStmt>

lcerrato commented 3 months ago

Source description. Note you should also indicate this is Volume 1 only in the title and notes.

        <sourceDesc>
                <biblStruct>
                    <monogr>
                        <author xml:lang="lat">Sextus Empiricus</author>
                        <title>Outlines of Pyrrhonism</title>
                        <editor role="translator">Robert Gregg Bury</editor>                    
                        <imprint>
                            <pubPlace>London</pubPlace>
                            <publisher>William Heinemann Ltd.</publisher>
                            <pubPlace>New York</pubPlace>
                            <publisher>G. P. Putnam's Sons</publisher>
                            <date type="printing">1933</date>
                        </imprint>
                        <biblScope unit="volume">1</biblScope>
                    </monogr> 
                    <series>
                        <title>Loeb Classical Library</title>
                    </series>
                    <ref target="https://archive.org/details/in.ernet.dli.2015.183761/page/2/mode/2up">Internet Archive</ref>
                </biblStruct>
            </sourceDesc>

lcerrato commented 3 months ago

Also needed, editorial notes, references declaration, profile description with language usage, and change log.

<encodingDesc>
            <editorialDecl><p>Volume 1 only.</p></editorialDecl>

            <refsDecl n="CTS">
            <cRefPattern matchPattern="(\w+).(\w+).(\w+)" n="section" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']/tei:div[@n='$2']/tei:div[@n='$3'])">
            <p>This pointer pattern extracts book, chapter, and section.</p></cRefPattern>
                <cRefPattern matchPattern="(\w+).(\w+)" n="chapter" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']/tei:div[@n='$2'])">
                    <p>This pointer pattern extracts book and chapter.</p></cRefPattern>
                <cRefPattern matchPattern="(\w+)" n="book" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])">
                    <p>This pointer pattern extracts book.</p></cRefPattern>
            </refsDecl>
        </encodingDesc>

Note, you appear to have "la" for Latin: 3 letter codes are suggested, so this should be "lat". You can't include foreign languages in an xml file without telling the machine what they are and how to process them. So:

        <profileDesc>
            <langUsage>
                <language ident="grc">Greek</language>
                <language ident="lat">Latin</language>
            </langUsage>
        </profileDesc>

Please outline what you did with the file. (OCR? proofreading? markup? etc.)

        <revisionDesc>
            <change when="2024-08" who="Lisa Cerrato">Header review, markup review, CTS and EpiDoc review for compliance.</change>
            <change when="2024-08" who="William Roush">???</change>
        </revisionDesc>

lcerrato commented 3 months ago

Missing structure.

You have no <body> tag and no top level <div>:

<text>
<body>
<div type="translation" n="urn:cts:greekLit:tlg0544.tlg001.1st1K-eng1" xml:lang="eng">

[ ] Subtypes are incorrect "Book" must be "book" and "Chapter" must be "chapter" (consistency matters with the labels)
[ ] Latin is mistagged as "la"
[ ] Silent edits? Where and what are they? I think you mentioned this was left undone. I can recommend tagging if I know the details.

lcerrato commented 3 months ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues.

Here, for example:

<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>

before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started.

I think this is probably a text-wide issue of missing tags.

WDRoush commented 3 months ago

Missing structure.

You have no <body> tag and no top level <div>:
<text>
<body>
<div type="translation" n="urn:cts:greekLit:tlg0544.tlg001.1st1K-eng1" xml:lang="eng"> 
[ ] Subtypes are incorrect "Book" must be "book" and "Chapter" must be "chapter" (consistency matters with the labels)

[ ] Latin is mistagged as "la"

[ ] Silent edits? Where and what are they? I think you mentioned this was left undone. I can recommend tagging if I know the details.

New Link: https://tufts.box.com/s/1naat149yb5a5mo5m08r1u4n45hq8u4m

Fixed subtypes, Fixed Lat. tags.

Re: Silent edits. There are two. 1: Note at line 283 of xml, or section 15 of Book 1. Since I removed the page numbers, I changed, “Cf. p. 30 note a” to “Cf. note a in §§48.” 2: Note at line 1646 of xml, or section 138 of Book I. I changed, “down to the infimae species (e.g. “Negroes’’)” to “down to the infimae species (e.g. ‟Golden Retriever”).”

WDRoush commented 3 months ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues.

Here, for example:
<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>
before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started.

I think this is probably a text-wide issue of missing tags.

There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?

WDRoush commented 3 months ago

@WDRoush <title xml:lang="eng">Outlines of Pyrrhonism</title> The language attribute is not needed here as the header is already tagged as English. <author xml:lang="eng">Sextus Empiricus</author> Conversely, the language attribute is wrong here, as this is a Latin name.

<funder> Did Tisch Library provide direct funding for the work?

The responsibility statement is boilerplate and should read:
                    <respStmt>
                    <resp>Published original versions of the electronic texts</resp>
                    <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
                    <persName role="principal">Gregory Crane</persName>
                    <persName role="principal">Leonard Muellner</persName>
                    <persName role="principal">Bruce Robertson</persName>
                    </respStmt>  
Following that, you must indicate responsibility to individuals as follows:
                       <respStmt>
                        <persName>William Roush</persName>
                        <orgName>Tufts University</orgName>
                        <resp>Digital conversion and editing</resp>
                    </respStmt>

        <respStmt>
          <persName>Lisa Cerrato</persName>
          <orgName>Perseus Digital Library</orgName>
          <resp>Digital editor</resp>
        </respStmt>
No one else (Crane, Saxon, Babeu) would typically require credit here as we did not do the digital editing.

Yes, that's correct, the Tisch Library directly funded the project.

lcerrato commented 3 months ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues. Here, for example:
<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>
before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started. I think this is probably a text-wide issue of missing tags.
There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?

This is a general XML requirement. XML requires closing everything within any div tag. So nothing can span divs. (Even where you might have a long running quote—such as a speech or conversation that spans several sections or chapters—it has to be closed and then we indicate that there is a continuation.)

A <p> denotes a container of text but does not have to match a print paragraph. There are different ways of representing blocks of text but for Perseus purposes a <p> tag is a basic prose container. One can create xml with much more nuanced containers.

A <p> is always within something else, it must be closed.

In Perseus, we mark where the print paragraph begins or ends with a new p tag that has an indentation attribute. Note that not all Perseus texts have indentation tagged. If the indention attribute (rend="align(indent)") is omitted, there is no indication where print paragraphs start.

In the following example, a new print paragraph begins at the start of section 5 and section 7 and within section 7. New print paragraphs are nowhere else. Section 8 also contains a block quote.

<div type="textpart" subtype="section" xml:base="..." n="5">
<p rend="align(indent)">...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="6">
<p>...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="7">
<p rend="align(indent)">...</p>
<p rend="align(indent)">...</p>   
</div>
<div type="textpart" subtype="section" xml:base="..." n="8">
<p>...<quote rend="blockquote">...</quote></p>
</div>

WDRoush commented 3 months ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues. Here, for example:
<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>
before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started. I think this is probably a text-wide issue of missing tags.
There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?
This is a general XML requirement. XML requires closing everything within any div tag. So nothing can span divs. (Even where you might have a long running quote—such as a speech or conversation that spans several sections or chapters—it has to be closed and then we indicate that there is a continuation.)

A <p> denotes a container of text but does not have to match a print paragraph. There are different ways of representing blocks of text but for Perseus purposes a <p> tag is a basic prose container. One can create xml with much more nuanced containers.

A <p> is always within something else, it must be closed.

In Perseus, we mark where the print paragraph begins or ends with a new p tag that has an indentation attribute. Note that not all Perseus texts have indentation tagged. If the indention attribute (rend="align(indent)") is omitted, there is no indication where print paragraphs start.

In the following example, a new print paragraph begins at the start of section 5 and section 7 and within section 7. New print paragraphs are nowhere else. Section 8 also contains a block quote.
<div type="textpart" subtype="section" xml:base="..." n="5">
<p rend="align(indent)">...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="6">
<p>...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="7">
<p rend="align(indent)">...</p>
<p rend="align(indent)">...</p>   
</div>
<div type="textpart" subtype="section" xml:base="..." n="8">
<p>...<quote rend="blockquote">...</quote></p>
</div>

https://tufts.box.com/s/1naat149yb5a5mo5m08r1u4n45hq8u4m

I went through and added paragraph markers (using the attribute you suggested), and fixed the validation errors. I used TEI's validator, so things should be good there.

lcerrato commented 2 months ago

@WDRoush Unfortunately, the work is failing the tests due to duplicate nodes.

The start of Book 3 appears incorrect.

Structure reads: Book 3 Section 1 Chapter 1 (not nested) Chapter 2 (not nested) Chapter 3 (not nested) Section 2-12 (nested in Chapter 3)
Chapter 4 (not nested) Section 13-16 (nested) etc.

This results in a duplicate 3.1 container. And you have Book-Section-Chapter rather than Book-Chapter-Section

lcerrato commented 2 months ago

@WDRoush I noticed paragraph alignment for <p> tags in the header. These were removed.

lcerrato commented 2 months ago

Fixed header infö that had volume limitations.

WDRoush commented 2 months ago

@WDRoush Unfortunately, the work is failing the tests due to duplicate nodes.

The start of Book 3 appears incorrect.

Structure reads: Book 3 Section 1 Chapter 1 (not nested) Chapter 2 (not nested) Chapter 3 (not nested) Section 2-12 (nested in Chapter 3) Chapter 4 (not nested) Section 13-16 (nested) etc.

This results in a duplicate 3.1 container. And you have Book-Section-Chapter rather than Book-Chapter-Section

This is a tricky section. In the print edition, there is text which appears in Bk. III before the chapters begin, and the section spans two chapter after that. See here: https://dl.tufts.edu/pdfviewer/mp48st66j/7h14b4217 Is there a better way to represent that given our container structures?

lcerrato commented 2 months ago

@WDRoush

Yes, this is the result of the non-standard chapters that Loeb imposes here. Where the other versions I see use only Book/Section, this does not present an issue.

It suggests that Book-Section-Chapter is the better hierarchy or just Book-Section. But since we are where we are you can either put the text into 3.1.1 and note that it falls outside of the Chapter in the print edition. Or you can create chapter 0 and section 0.

The latter is what we would typically do (create 3.0.0) but it will not align with other versions of the work in this spot as the Greek will say that 3.0 does not exist. https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0544.tlg001.1st1K-grc1:3.1

If I were adding the Loeb Greek, that would be fine, as there would be another alignment. In this case, I would move the line into 3.1.1 with a note to keep your hierarchy intact.

WDRoush commented 2 months ago

@WDRoush

Yes, this is the result of the non-standard chapters that Loeb imposes here. Where the other versions I see use only Book/Section, this does not present an issue.

It suggests that Book-Section-Chapter is the better hierarchy or just Book-Section. But since we are where we are you can either put the text into 3.1.1 and note that it falls outside of the Chapter in the print edition. Or you can create chapter 0 and section 0.

The latter is what we would typically do (create 3.0.0) but it will not align with other versions of the work in this spot as the Greek will say that 3.0 does not exist. https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0544.tlg001.1st1K-grc1:3.1

If I were adding the Loeb Greek, that would be fine, as there would be another alignment. In this case, I would move the line into 3.1.1 with a note to keep your hierarchy intact.

Update: https://tufts.box.com/s/1naat149yb5a5mo5m08r1u4n45hq8u4m

I changed the structure as you suggested and added a note (see line 6000). I put resp. attribute as "William Roush". Is that alright?

lcerrato commented 2 months ago

@WDRoush I see the changes. I moved the note to the start of the section, rather than the end and will test again shortly. Chapter 2 still requires a section. (You cannot have a Chapter without a Section or the Section will not display in the reader).

Some of the prior edits I made and noted were not in your version. Always take the last edited version from here so that the errors can be integrated. (This is why we use GitHub).

I also see entities in here < and >-- ideally these would be real Unicode. We never include entities in a finished text.

In general in a Loeb, square brackets signify an editorial deletion so we use <del> and less than / greater than is an addition so that is <add>.

I made changes to the title (I removed those Book 1 notations as errors that I made) and fixed the square and pointed brackets issue.

lcerrato commented 2 months ago

@WDRoush I am still seeing structural issues here. The Greek tells me there are 784 sections whereas the English is showing only 532.

tlg0544.tlg001.1st1K-eng1.xml | 69,191 | 3;57;532 tlg0544.tlg001.1st1K-grc1.xml | 52,709 | 3;784

lcerrato commented 2 months ago

@WDRoush There were missing tags in the final book that prevented the entire hierarchy from being detected. I believe this has been resolved.

WDRoush commented 2 months ago

@WDRoush There were missing tags in the final book that prevented the entire hierarchy from being detected. I believe this has been resolved.

Thank you for fixing that!

lcerrato commented 2 months ago

Current output tlg0544.tlg001.1st1K-eng1.xml | 69,191 | 3;88;782
tlg0544.tlg001.1st1K-grc1.xml | 52,709 | 3;784

OpenGreekAndLatin / First1KGreek

(urn:cts:greekLit:tlg0544.tlg001) Sextus Empiricus translation ingestion #2791