Simon-Initiative / course-digest

Tool to produce a summary or digest of OLI course package contents
MIT License
2 stars 0 forks source link

[FEATURE] Handle html5 entities #73

Closed darrensiegel closed 2 years ago

darrensiegel commented 2 years ago

This PR adds more robust support for converting encoded entities from legacy content into their literal character within the Torus JSON data models.

In theory, only XML entities are supported by the legacy content model (because these files are XML). So that means only named escapes such as &, ' < and the decimal and unicode character encodings such as © ∆. XML does not support the extended collection of named entities such as ×.

But, it appears that OLI runtime supports this more complete set of HTML5 entities as × is found several hundred times through the chemistry courses.

When mutating an XML document, Cheerio only supports the XML subset, so it thinks that the ampersand in × needs to be escaped, so it ends up writing it out at ×.