Fix entities when checking databse

ankitects / anki

Anki's shared backend and web components, and the Qt frontend

https://apps.ankiweb.net

Other

18.78k stars 2.13k forks source link

Fix entities when checking databse #1087

Open RumovZ opened 3 years ago

RumovZ commented 3 years ago

Notes can contain invalid HTML entities, mainly because the user imported unencoded text as HTML. This doesn't pose a problem for Anki's webviews but the Rust part doesn't handle it that well, unfortunately. It would be nice if checking the database:

converted entities to Unicode where that's save (as the Rust crate doesn't recognise the more exotic entities),
encoded &s that don't belong to any valid entity.

In fact, this is exactly what happens when a field is opened in the HTML editor. Since there doesn't seem to be a Rust crate that supports advanced decoding, maybe the webtools can somehow be utilised here?

dae commented 3 years ago

html5ever (which is already used in the codebase) can handle broken HTML and arbitrary entities:

echo 'hello & also &#4023; world' | cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `/Users/dae/Local/rust/debug/tmpa`
<!DOCTYPE html>
<html><head></head><body>hello &amp; also ྷ world
</body></html>

It's not a light-weight solution though, and I'd probably want to have it used for a while when saving notes before feeling confident enough to expose it to bulk operations like a DB check, as there's always a risk that normalizing content will cause changes users don't want.

dae commented 3 years ago

(for solving the more immediate problem of stripping HTML, we could potentially run it when we get an error back from decode_entities, without persisting changes to the db)