edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

Replace cChardet with something compatible with current Python versions #165

Open Mr0grog opened 8 months ago

Mr0grog commented 8 months ago

cChardet is no longer maintained and is not readily compatible with the last two major releases of Python (3.11 and 3.12), so we probably need to replace it: https://github.com/PyYoshi/cChardet/issues/81

I did a bunch of research and testing a few weeks ago on alternatives that I still need to write up, but the bottom line is that there aren’t really any good options. What’s on the table:

  1. Go back to chardet. It’s pure Python and still works, but is not as accurate as other options, is really slow, and is blocking, which is not great.

  2. Switch to charset-normalizer. It is also pure Python and claims drastically improved accuracy and performance over chardet, but this isn’t actually consistent or broadly true in my testing. It’s highly dependent on having encoding declarations in the content being sniffed as a shortcut, and in all other cases is much slower and has similar accuracy to chardet. Since we already check for declarations, we’ll only see the slowest cases here.

    (OTOH, it is sometimes more accurate if the declaration is wrong, since it only treats the declaration as a hint. But there’s some reasonable debate over whether that’s the right thing to do, since it differs from how browsers behave. During testing I also learned a lot about how browsers treat declarations, which is much more complicated and nuanced than I’d realized, and charset-normalizer doesn’t leverage the hints as well as I now understand it could — I should probably file some issues.)

  3. Switch to faust-cchardet, which is a fork of cChardet with patches to make it work in modern Pythons. Unfortunately, it uses problematic naming that could break things in an environment with other packages that rely on cChardet, since it takes over the cchardet import name, rather than using its own. The author has suggested some vague interest in taking over cchardet, which would solve the issue, but doesn’t seem to actually be moving forward on it (https://github.com/faust-streaming/cChardet/issues/32). Absent that, I worry this creates complex dependency issues in any situation where someone would install web-monitoring-diff as a library of it is installed in a Python environment with other CLI tools.

    I’m also a little concerned that there’s not any strong energy for long-term maintenance on this one, and switching to it could just land us in the same situation as we are currently in.

  4. Switch to chardetng-py, a Python wrapper around chardetng, which is written in Rust and used in Firefox. It is much more accurate than chardet or charset-normalizer, and also much faster (between half and just as fast as cChardet). It supports a much more limited set of encodings though (these days, browsers generally have a more constrained set of supported encodings and a dedicated spec all about it. To the extent that we want to act like a browser does, that’s fine.

    One complex downside here is that this requires much more careful handling of the encodings it finds, because the names it returns don’t always indicate the same decoders that Python uses for those names: https://github.com/john-parton/chardetng-py/issues/11

    I’m also a little concerned that there’s not strong energy for long-term maintenance on this one, and switching to it could just land us in the same situation as we are currently in. There’s definitely not much intent to update except for really serious bugs in the underlying chardetng library, it seems: https://github.com/hsivonen/chardetng/issues/13, https://github.com/hsivonen/chardetng/pulls

Of the available options, I think (4) is probably best, followed up by (2). The biggest problem with (4) is the maintenance concerns (but also special treatment for the values it detects). I’m not super happy with any of these, though. 😞

This is a blocking issue for #128.