Open RumovZ opened 3 years ago
html5ever (which is already used in the codebase) can handle broken HTML and arbitrary entities:
echo 'hello & also ྷ world' | cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `/Users/dae/Local/rust/debug/tmpa`
<!DOCTYPE html>
<html><head></head><body>hello & also ྷ world
</body></html>
It's not a light-weight solution though, and I'd probably want to have it used for a while when saving notes before feeling confident enough to expose it to bulk operations like a DB check, as there's always a risk that normalizing content will cause changes users don't want.
(for solving the more immediate problem of stripping HTML, we could potentially run it when we get an error back from decode_entities, without persisting changes to the db)
Notes can contain invalid HTML entities, mainly because the user imported unencoded text as HTML. This doesn't pose a problem for Anki's webviews but the Rust part doesn't handle it that well, unfortunately. It would be nice if checking the database:
In fact, this is exactly what happens when a field is opened in the HTML editor. Since there doesn't seem to be a Rust crate that supports advanced decoding, maybe the webtools can somehow be utilised here?