CoEDL / nyingarn-workspace

The Nyingarn data ingest and preparation application
GNU General Public License v3.0
0 stars 0 forks source link

Download TEI stub files as valid TEI documents #66

Open marcolarosa opened 2 years ago

marcolarosa commented 2 years ago

Downloading the stub files should return a valid TEI document that is constructed on the fly.

Conal-Tuohy commented 2 years ago

I have a draft XSLT from a while back for stitching the surface files together. It's not complete, but I think it needs only a little extra debugging work, to do that part.

Also the TEI file has a metadata header (a teiHeader element) which should be populated with inputs from other sources; the original source file (before it was split into surfaces), the describo database, the RO-Crate file, or wherever. The teiHeader is able to store a tonne of different metadata, but its absolute minimum requirements are:

Any metadata we have which corresponds to those elements should get inserted into the teiHeader.

Conal-Tuohy commented 2 years ago

There are two stylesheets which reconstitute a full TEI file from the <surface> elements contained in the "stub" XML files, and a <teiHeader> header element copied from the originally-ingested file (if the original file was TEI) or from metadata encoded in the digivol CSV (if the uploaded file was a digivol CSV):

marcolarosa commented 1 year ago

@Conal-Tuohy Can we make this one file?

Minimum viable metadata list as per comment 5/7.

marcolarosa commented 1 year ago

@Conal-Tuohy This is actually in reference (as you thought) to being able to download a valid TEI document for a whole item. That is, containing all of the page surfaces and TEI header.

Conal-Tuohy commented 1 year ago
Conal-Tuohy commented 1 year ago

So this is working but I still have a bug which affects the reconstitution of only documents with a complex structure. These are documents which had a hierarchy which cross-cut their pagination (i.e. containing logical structures which did not fall entirely within a single page). The ingestion stylesheet splits those sections at the page boundaries, and the bug in the re-assembly stylesheet means that those sections are not rejoined. I'm going to issue a PR anyway, since this is only a limitation rather than a blocker, and work on debugging on it while @marcolarosa can work on integrating the reassembly with the user's workflow.