Simon-Initiative / course-digest

Tool to produce a summary or digest of OLI course package contents
MIT License
2 stars 0 forks source link

[BUGFIX] preserve numerically encoded nbsp characters [MER-2834] #227

Closed andersweinstein closed 10 months ago

andersweinstein commented 11 months ago

Logic and Proofs includes Unicode nobreak space characters via numerical character reference   using sequences of these with strikethrough styling to achieve a horizontal line used to separate premises from conclusion in multiline presentations of an argument structure. These nbsp's were not getting through the migration tool.

These character references should be understood in any XML processor (as should hex  or simply embedding the unencoded Unicode nbsp character in the UTF8 character stream), but seems cheerio parser has a bug in whitespace normalization so it applies to these nbsp characters, collapsing sequences into a single normal space, at least in XML mode we are using. (HTML mode with entity decoding might rewrite them to the   HTML entity reference in output html. But   is not automatically defined as a named entity in XML).

This PR preserves them by rewriting any nbsp characters to   before the whitespace normalizing parse. These were defined in legacy OLI XML and the migration tool already includes special code to decode them properly when converting text to JSON.