benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
170 stars 51 forks source link

Small TEI Export Problems #1036

Open benwbrum opened 6 years ago

benwbrum commented 6 years ago

Several problems exist in the TEI-XML export format; most of which are too small to warrant their own issue.

No URL in facs

The pb element at the beginning of each page contains a facs attribute that is supposed to point to a location of the page facsimile. Currently it is a relative URL instead of something usable by systems we export to:

         <pb xml:id="F7884" n="3" facs="/image-service/7884/full/full/0/native.jpg" />

All paragraphs have corresp

Bilingual texts are supposed to link paragraphs using a corresp attribute. We do this correctly for works which support translation, but also generate a corresp attribute on p elements when we don't support translation, leaving a dangling reference:

            <p corresp="TTP7884P0" xml:id="OTP7884P0">

TEI is not correctly formatted.

Simply reading the rendered TEI into Nokogiri and outputting it again as the rendering process would fix the problems with indentation resulting from sections of XML generated by views and helper functions.

Subjects within section headers are not linked

If a subject is marked up within a section header, the generated TEI seems to omit the whole rs tag replacement we use elsewhere:

               <head depth="2">69  Mark Veazey PP Bro William Dr.</head>
benwbrum commented 6 years ago

In addition to these:

Remove revisionList

The list of edits we produce in the revisionList is too long to be meaningful -- remove it entirely.

Move people and places out of taxonomy

Changes to add non-people/place subjects to the teiHeader moved those subjects out of personList or placeList and into taxonomy. This should be reversed.

bencomp commented 6 years ago

Regarding the revision list: I do like it, but it is indeed very long – mostly because even when importing a work from IIIF or PDF, each page 'import' is a change.

Another small issue I encountered is that FromThePage.com is hardcoded in the subelements of <editionStmt>: <resp>Initial upload of this work's facsimile images and metadata to FromThePage.com for editing</resp> is exported from our local FromThePage.

benwbrum commented 1 year ago

The facs issue is fixed by #3388

benwbrum commented 2 months ago

Corresp is covered by #4208