Declaring character entities with CETEIcean?

D-Groenewegen commented 2 years ago

First off, thank you for this wonderful, rather useful piece of software. The relative ease of setting up a TEI project that it affords seems like a great step in making TEI XML more accessible.

When I'm working or just experimenting with TEI documents, many of them appear to have their character entities declared in a separate DTD file or .ent files, if I understand things correctly, usually through a relative link.

For instance, it is quite common for celt.ucc.ie to encode accented characters (e.g. ó, &amacron;) and rarer glyphs such as Tironian et (⁊) and map them to their equivalents.

Without those character references, the document fails to get rendered in CETEIcean (XML parsing error).

In these cases, the DTD and ENT files are not always publicly accessible, but I've compiled a list of currently up to 40 character references that I can manually insert directly at the top of the TEI XML document and that usually does the job.

However, that approach isn't exactly efficient and practical if it needs to be repeated for numerous documents, let alone in the event of having to update the list. It also requires one to modify the original documents, which detracts from the plug-and-play experience.

It would be great if CETEIcean could be told to read from a file containing character references before moving on to transform the TEI document.

hcayless commented 2 years ago

This is an interesting question. Entity resolution is something that happens when the XML document is parsed, so in general, CETEIcean comes in too late to do anything about it—it will fail when trying to read the XML document, because it won't be well-formed without proper entity references. Crucially, it will only load certain types of external entity reference (those related to HTML or other web stuff). BUT, maybe there's some scope for inserting entity definitions via string manipulation. It would certainly be useful for some older documents.

The other option would be to pre-process the files, resolving the entity references, and use the results instead. But I'll play around with the idea of adding entities. I actually ran across a case where that would have been useful earlier this week...

raffazizzi commented 2 years ago

I would lean towards preprocessing: not all documents are going to be easily plug-and-play. An example is milestones and stand-off encoding such as a <delSpan> to <anchor>. Depending on rending goals, some operations may be addressable via CETEIcean behaviors, but others will require some sort of pre-processing of the XML data (or Custom Elements data).

On the other hand, entities are a feature of XML, which makes it more urgent to support than an arbitrary stand-off encoding model. I look forward to see what you'll come up with @hcayless; I wonder if having a somewhat standardized way of injecting pre-processing functions may be useful here. I ended up taking that approach in gastby-transformer-ceteicean.

D-Groenewegen commented 2 years ago

Thanks for your replies and input! I'm currently having ill-timed problems with my laptop so apologies for the delay.

Am I correct in assuming that both approaches would require a preliminary stage of intervention using both DOMParser and XMLSerializer; and that pre-processing would mean that entity references are inserted into the resulting XML, as opposed to direct substitutions of character entities using string manipulation?

Admittedly, I know too little of either process and what each involves to have a strong preference for one or the other. Some criteria that I imagine are worth considering in deciding between the two are

The approach to accommodating custom entity references, whether standardised lists like isolat1 or more idiosyncratic additions. Should they be listed in a single .ent file that users can customise (which may be a better fit for pre-processing) or are you envisaging a different format?
Performance, esp. which one scales better with bulkier documents, unless the difference is marginal
Potentially, the broader architecture for supporting pre-processing functions, if that's the intention.
The amount of effort that goes into writing and maintaining code.

Anyway, looking forward to your solution!

TEIC / CETEIcean

Declaring character entities with CETEIcean? #55