faustedition / faust-gen-html

Pipelines to generate HTML for the Faust edition's reading texts and prints.
1 stars 4 forks source link

Faust Edition Text Generation

This project contains processing steps for generating the more reading-text, less diplomatic text representations of the Faust edition and most other generated or converted data, except for the diplomatic text representations

This is work in progress.

Usage

This is mainly used as a submodule to https://gihub.com/faustediion/faust-gen – the easiest way to run this is to checkout that module including all submodules and running mvn -Pxproc there.

Alternatively, you need:

You should then clone this repository and edit the configuration file, config.xml as you see fit (e.g., enter the path to your copy of the Faust data). You could also leave the config file as it is and pass in the relevant parameters as parameters to the XML processor.

To generate all data, run the pipeline generate-all, e.g., using

calabash generate-all.xpl

This will run all processing steps and generate the HTML data in subdirectories of target by default.

Source Code Details

Basically, we need to perform three steps, in order:

  1. Generate a list of witnesses from the metadata in the Faust XML's documents folder
  2. Generate the HTML fragments that form the apparatus
  3. For each text to render, generate the HTML representation (in split and unsplit form).

All steps read config.xml, and all XSLT stylesheets have the parameters defined there available. All parameters from config.xml can also be passed by the usual means of passing parameters to pipelines (like calabash's -p option).

List Witnesses: collect-metadata.xpl

The output is a list of <textTranscript> elements, here is an example:

<textTranscript xmlns="http://www.faustedition.net/ns"
        uri="faust://xml/transcript/gsa/391083/391083.xml"
        href="https://github.com/faustedition/faust-gen-html/blob/master/file:/home/vitt/Faust/transcript/gsa/391083/391083.xml"
        document="document/maximen_reflexionen/gsa_391083.xml"
        type="archivalDocument"
        f:sigil="H P160">
   <idno type="bohnenkamp" uri="faust://document/bohnenkamp/H_P160" rank="2">H P160</idno>
   <idno type="gsa_2" uri="faust://document/gsa_2/GSA_25/W_1783" rank="28">GSA 25/W 1783</idno>
   <idno type="gsa_1"
     uri="faust://document/gsa_1/GSA_25/XIX,2,9:2"
     rank="50">GSA 25/XIX,2,9:2</idno>
</textTranscript>

href is the local path to the actual transcript, document is the relative URL to the metadata document. type is either archivalDocument or print. The <idno> elements are ordered by an order of preference defined in the pipeline (depending on type) and recorded in the respective rank attribute.

Generate Apparatus: collate-variants.xpl

This step performs three substeps that are controlled by additional files:

  1. apply-edits.xpl (for each transcript) – TEI preprocessing, see separate section
  2. extract-lines.xsl (for each transcript) – filters only those TEI elements that represent lines used for the apparatus (including descendant nodes), augmenting them with provenance attributes
  3. variant-fragments.xsl – sorts and groups the lines, and transforms them to HTML.

Preprocessing the TEI files: apply-edits.xpl

This removes the genetic markup from the textual transcripts by applying the edits indicated by the markup. Thus, the result represents the last state of the text in the input document.

The document is passed through the following steps:

  1. textTranscr_pre_transpose.xsl normalizes references inside ge:transpose
  2. textTranscr_transpose.xsl applies transpositions
  3. emend-core.xsl (previous name: textTranscr_fuer_Drucke.xsl) applies genetic markup (del, corr etc.), performs character normalizations and a set of other normalizations. This also includes the rules for harmonize-antilabes.xsl, which transforms the antilabe encoding that are in the the join form to the part form so we only have to deal with one form in the further processing.
  4. text-emend.xsl applies genetic markup that is using spanTo etc. Attention, this step will remove text if you include delSpan elements that point to a non-existing anchor. The script will print a warning if it detects such a case.
  5. clean-up.xsl removes TEI containers that are empty after the steps above.
  6. prose-to-lines.xsl transforms the <p>-based markup in Trüber Tag. Feld. to a <lg>/<l> based markup as in the verse parts to ease collation.

Generate the master HTML files: print2html.xpl

Steps:

  1. apply-edits.xpl, TEI normalization, see above
  2. resolve-pbs.xsl, augments <pb> elements with a normalized page number used
  3. print2html.xsl, the actual transformation to html

The page map

When generating HTML from longer documents, these are split into multiple HTML files along TEI <div> elements. This can be configured from the configuration file.

To find out which page is where, we generate an index that maps faust:// URIs and pages to HTML file names. This is a two-step process, the print2html.xpl pipeline generates an XML summary outlining files and pages of a single document (see pagemap.xsl for details), pagelist2json.xsl converts the information from all these documents to a single JSON file. You can then generate links in the form filename#dtpagenumber to link to the individual files.

Additional source files

Experimental stuff

Einblendungsapparat

There is experimental code to generate an Einblendungsapparat as well. This kind of apparatus is based on the first level of the text, not the last, and it signifies later editings in the text in special markup using editorial notes in 〈angled brackets〉. The current implementation is still unfinished and renders only the most frequent editings.

The CSS rules required for the apparatus are currently at the end of lesetext.css. Please again note that this is a moving target.