OpenGreekAndLatin / First1KGreek

XML files for the works in the First Thousand Years of Greek Project. Please see our Wiki on how to contribute.
https://opengreekandlatin.github.io/First1KGreek/
Creative Commons Attribution Share Alike 4.0 International
91 stars 85 forks source link

(urn:cts:greekLit:tlg0544.tlg001) Sextus Empiricus translation ingestion #2791

Open lcerrato opened 1 month ago

lcerrato commented 1 month ago

adding contribution from @WDRoush

lcerrato commented 1 month ago

@AlisonBabeu What should the URN be?

lcerrato commented 1 month ago

tlg0544.tlg001.1st1K-eng1 suggested

lcerrato commented 1 month ago

@WDRoush @gregorycrane

You want to get started with a header. A template is linked in the wiki.

Note: I updated the above wiki page with a better template.

lcerrato commented 1 month ago

@WDRoush

Preliminary recommendations.

Q: Were print page numbers captured? I don't see them. Consider a line wrap within paragraphs for ease of reading.

lcerrato commented 1 month ago

Address silent changes and best practices.

WDRoush commented 1 month ago

@WDRoush

Preliminary recommendations.

  • [ ] Look for spacing around hyphens, hyphens that should be em dashes
  • [ ] Straight quotes should be curly/smart quotes. (Be careful not to change those within tags)
  • [ ] Add resp attribute to notes <note resp="Loeb" ... >

Q: Were print page numbers captured? I don't see them. Consider a line wrap within paragraphs for ease of reading.

Got these three things done, and got some headway on the header, but I might need some help completing that. I put the updated version on Box: https://tufts.box.com/s/swd6ynxe8hxjy34p1vtgyonf74a5bppd

A: I removed the print pages, but I can add them back if that is preferred. I will work on line wrapping as I work on other best practices.

lcerrato commented 1 month ago

@WDRoush
The new file has been uploaded. As the work is large, you may not find it worthwhile to add back print page notations at this phase. I have not made any changes.

You can see the differences in your versions now in the pull request: https://github.com/OpenGreekAndLatin/First1KGreek/pull/2792/commits/ee51201187b1994e3dd88ad4b38e2c0344cea28d

One suggestion: You may want to think about working on a fork of this repository to easily push your changes. (Another option is to attach files within this issue.) Depending on your project goals, that might be worth it in the long run. @msaxton @AlisonBabeu might have good pointers to some doc to get you started on that aspect of using GitHub if you are interested.

lcerrato commented 2 weeks ago

@WDRoush Have you had a chance to look at an existing header or the example template? I find it easier to use a passing file rather than trying to recreate, as this inevitably results in missing info.

There are a few details about the file that we need to fill in (like a checklist) and using an existing header is easiest.

lcerrato commented 2 weeks ago

@WDRoush
<title xml:lang="eng">Outlines of Pyrrhonism</title> The language attribute is not needed here as the header is already tagged as English. <author xml:lang="eng">Sextus Empiricus</author> Conversely, the language attribute is wrong here, as this is a Latin name.

<funder> Did Tisch Library provide direct funding for the work?

The responsibility statement is boilerplate and should read:

                    <respStmt>
                    <resp>Published original versions of the electronic texts</resp>
                    <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
                    <persName role="principal">Gregory Crane</persName>
                    <persName role="principal">Leonard Muellner</persName>
                    <persName role="principal">Bruce Robertson</persName>
                    </respStmt>  

Following that, you must indicate responsibility to individuals as follows:

                       <respStmt>
                        <persName>William Roush</persName>
                        <orgName>Tufts University</orgName>
                        <resp>Digital conversion and editing</resp>
                    </respStmt>

        <respStmt>
          <persName>Lisa Cerrato</persName>
          <orgName>Perseus Digital Library</orgName>
          <resp>Digital editor</resp>
        </respStmt>

No one else (Crane, Saxon, Babeu) would typically require credit here as we did not do the digital editing.

lcerrato commented 2 weeks ago

publication statement and source description are mandatory.

                   <publicationStmt>
                    <publisher>Trustees of Tufts University</publisher>
                    <publisher>Open Greek and Latin</publisher>
                    <pubPlace>Medford, MA</pubPlace>
                    <authority>Perseus Digital Library</authority>
                    <date when="2024-09-01"/>
                    <idno type="filename">tlg0544.tlg001.1st1K-eng1.xml</idno>
                    <availability>
                  <licence target="https://creativecommons.org/licenses/by-sa/4.0/">Available under a Creative Commons Attribution-ShareAlike 4.0 International License</licence>
                  </availability>
                </publicationStmt>
lcerrato commented 2 weeks ago

Source description. Note you should also indicate this is Volume 1 only in the title and notes.

        <sourceDesc>
                <biblStruct>
                    <monogr>
                        <author xml:lang="lat">Sextus Empiricus</author>
                        <title>Outlines of Pyrrhonism</title>
                        <editor role="translator">Robert Gregg Bury</editor>                    
                        <imprint>
                            <pubPlace>London</pubPlace>
                            <publisher>William Heinemann Ltd.</publisher>
                            <pubPlace>New York</pubPlace>
                            <publisher>G. P. Putnam's Sons</publisher>
                            <date type="printing">1933</date>
                        </imprint>
                        <biblScope unit="volume">1</biblScope>
                    </monogr> 
                    <series>
                        <title>Loeb Classical Library</title>
                    </series>
                    <ref target="https://archive.org/details/in.ernet.dli.2015.183761/page/2/mode/2up">Internet Archive</ref>
                </biblStruct>
            </sourceDesc>
lcerrato commented 2 weeks ago

Also needed, editorial notes, references declaration, profile description with language usage, and change log.

<encodingDesc>
            <editorialDecl><p>Volume 1 only.</p></editorialDecl>

            <refsDecl n="CTS">
            <cRefPattern matchPattern="(\w+).(\w+).(\w+)" n="section" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']/tei:div[@n='$2']/tei:div[@n='$3'])">
            <p>This pointer pattern extracts book, chapter, and section.</p></cRefPattern>
                <cRefPattern matchPattern="(\w+).(\w+)" n="chapter" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']/tei:div[@n='$2'])">
                    <p>This pointer pattern extracts book and chapter.</p></cRefPattern>
                <cRefPattern matchPattern="(\w+)" n="book" replacementPattern="#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])">
                    <p>This pointer pattern extracts book.</p></cRefPattern>
            </refsDecl>
        </encodingDesc>

Note, you appear to have "la" for Latin: 3 letter codes are suggested, so this should be "lat". You can't include foreign languages in an xml file without telling the machine what they are and how to process them. So:

        <profileDesc>
            <langUsage>
                <language ident="grc">Greek</language>
                <language ident="lat">Latin</language>
            </langUsage>
        </profileDesc>

Please outline what you did with the file. (OCR? proofreading? markup? etc.)

        <revisionDesc>
            <change when="2024-08" who="Lisa Cerrato">Header review, markup review, CTS and EpiDoc review for compliance.</change>
            <change when="2024-08" who="William Roush">???</change>
        </revisionDesc>
lcerrato commented 2 weeks ago

Missing structure.

You have no <body> tag and no top level <div>:

<text>
<body>
<div type="translation" n="urn:cts:greekLit:tlg0544.tlg001.1st1K-eng1" xml:lang="eng"> 
lcerrato commented 2 weeks ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues.

Here, for example:

<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>

before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started.

I think this is probably a text-wide issue of missing tags.

WDRoush commented 1 week ago

Missing structure.

You have no <body> tag and no top level <div>:

<text>
<body>
<div type="translation" n="urn:cts:greekLit:tlg0544.tlg001.1st1K-eng1" xml:lang="eng"> 
  • [ ] Subtypes are incorrect "Book" must be "book" and "Chapter" must be "chapter" (consistency matters with the labels)
    • [ ] Latin is mistagged as "la"
    • [ ] Silent edits? Where and what are they? I think you mentioned this was left undone. I can recommend tagging if I know the details.

New Link: https://tufts.box.com/s/1naat149yb5a5mo5m08r1u4n45hq8u4m

Fixed subtypes, Fixed Lat. tags.

Re: Silent edits. There are two. 1: Note at line 283 of xml, or section 15 of Book 1. Since I removed the page numbers, I changed, “Cf. p. 30 note a” to “Cf. note a in §§48.” 2: Note at line 1646 of xml, or section 138 of Book I. I changed, “down to the infimae species (e.g. “Negroes’’)” to “down to the infimae species (e.g. ‟Golden Retriever”).”

WDRoush commented 1 week ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues.

Here, for example:

<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>

before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started.

I think this is probably a text-wide issue of missing tags.

There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?

WDRoush commented 1 week ago

@WDRoush <title xml:lang="eng">Outlines of Pyrrhonism</title> The language attribute is not needed here as the header is already tagged as English. <author xml:lang="eng">Sextus Empiricus</author> Conversely, the language attribute is wrong here, as this is a Latin name.

<funder> Did Tisch Library provide direct funding for the work?

The responsibility statement is boilerplate and should read:

                    <respStmt>
                    <resp>Published original versions of the electronic texts</resp>
                    <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
                    <persName role="principal">Gregory Crane</persName>
                    <persName role="principal">Leonard Muellner</persName>
                    <persName role="principal">Bruce Robertson</persName>
                    </respStmt>  

Following that, you must indicate responsibility to individuals as follows:

                       <respStmt>
                        <persName>William Roush</persName>
                        <orgName>Tufts University</orgName>
                        <resp>Digital conversion and editing</resp>
                    </respStmt>

        <respStmt>
          <persName>Lisa Cerrato</persName>
          <orgName>Perseus Digital Library</orgName>
          <resp>Digital editor</resp>
        </respStmt>

No one else (Crane, Saxon, Babeu) would typically require credit here as we did not do the digital editing.

Yes, that's correct, the Tisch Library directly funded the project.

lcerrato commented 1 week ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues. Here, for example:

<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>

before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started. I think this is probably a text-wide issue of missing tags.

There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?

This is a general XML requirement. XML requires closing everything within any div tag. So nothing can span divs. (Even where you might have a long running quote—such as a speech or conversation that spans several sections or chapters—it has to be closed and then we indicate that there is a continuation.)

A <p> denotes a container of text but does not have to match a print paragraph. There are different ways of representing blocks of text but for Perseus purposes a <p> tag is a basic prose container. One can create xml with much more nuanced containers.

A <p> is always within something else, it must be closed.

In Perseus, we mark where the print paragraph begins or ends with a new p tag that has an indentation attribute. Note that not all Perseus texts have indentation tagged. If the indention attribute (rend="align(indent)") is omitted, there is no indication where print paragraphs start.

In the following example, a new print paragraph begins at the start of section 5 and section 7 and within section 7. New print paragraphs are nowhere else. Section 8 also contains a block quote.

<div type="textpart" subtype="section" xml:base="..." n="5">
<p rend="align(indent)">...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="6">
<p>...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="7">
<p rend="align(indent)">...</p>
<p rend="align(indent)">...</p>   
</div>
<div type="textpart" subtype="section" xml:base="..." n="8">
<p>...<quote rend="blockquote">...</quote></p>
</div>
WDRoush commented 3 days ago

The file is throwing errors in validation: unclosed <div> and <p> tags. This is where an xml software editor is valuable: to show you these issues. Here, for example:

<div type="textpart" subtype="Book" n="1">
<div type="textpart" subtype="Chapter" n="1">
<head>Chapter I.—Of The Main Difference Between Philosophic Systems</head>
<div type="textpart" subtype="section" n="1"><p>
The natural result of any investigation is that the
investigators either discover the object of search or
deny that it is discoverable and confess it to be
inapprehensible or persist in their search.</div>
<div type="textpart" subtype="section" n="2">
So, too, with regard to the objects investigated by 
philosophy, this is probably why some have claimed to 
have discovered the truth, others have asserted that it 
cannot be apprehended, while others again go on inquiring.</p></div>

before "search", the <p> is not closed. then, after a new section is started at "So, too," there is no new <p> started. I think this is probably a text-wide issue of missing tags.

There seems to be a gap in my knowledge here. As I am using the tags in the doc, "p" denotes "new paragraph," and "div" denotes new sections following the Greek divisions. However, the paragraphs often span several sections, so I would not end the paragraph /p until the paragraph was over. Do I perhaps need to choose new tags for those things?

This is a general XML requirement. XML requires closing everything within any div tag. So nothing can span divs. (Even where you might have a long running quote—such as a speech or conversation that spans several sections or chapters—it has to be closed and then we indicate that there is a continuation.)

A <p> denotes a container of text but does not have to match a print paragraph. There are different ways of representing blocks of text but for Perseus purposes a <p> tag is a basic prose container. One can create xml with much more nuanced containers.

A <p> is always within something else, it must be closed.

In Perseus, we mark where the print paragraph begins or ends with a new p tag that has an indentation attribute. Note that not all Perseus texts have indentation tagged. If the indention attribute (rend="align(indent)") is omitted, there is no indication where print paragraphs start.

In the following example, a new print paragraph begins at the start of section 5 and section 7 and within section 7. New print paragraphs are nowhere else. Section 8 also contains a block quote.

<div type="textpart" subtype="section" xml:base="..." n="5">
<p rend="align(indent)">...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="6">
<p>...</p>
</div>
<div type="textpart" subtype="section" xml:base="..." n="7">
<p rend="align(indent)">...</p>
<p rend="align(indent)">...</p>   
</div>
<div type="textpart" subtype="section" xml:base="..." n="8">
<p>...<quote rend="blockquote">...</quote></p>
</div>

https://tufts.box.com/s/1naat149yb5a5mo5m08r1u4n45hq8u4m

I went through and added paragraph markers (using the attribute you suggested), and fixed the validation errors. I used TEI's validator, so things should be good there.

lcerrato commented 20 hours ago

@WDRoush Unfortunately, the work is failing the tests due to duplicate nodes.

The start of Book 3 appears incorrect.

Structure reads: Book 3 Section 1 Chapter 1 (not nested) Chapter 2 (not nested) Chapter 3 (not nested) Section 2-12 (nested in Chapter 3)
Chapter 4 (not nested) Section 13-16 (nested) etc.

This results in a duplicate 3.1 container. And you have Book-Section-Chapter rather than Book-Chapter-Section

lcerrato commented 1 hour ago

@WDRoush I noticed paragraph alignment for <p> tags in the header. These were removed.

lcerrato commented 50 minutes ago

Fixed header infö that had volume limitations.