amauryfa / lxml

lxml-cffi is a PyPy-friendly port of lxml, based on cffi
21 stars 10 forks source link

Add a distinct check_string_utf8 for unicode strings. #1

Open gsnedders opened 11 years ago

gsnedders commented 11 years ago

This proved to be a perf bottleneck for html5lib, and can trivially be reimplemented entirely within Python, never calling into lxml. (The bytes case probably ought to be converted to be internal to Python too.)

Note that I haven't actually checked what libxml2 does here, whether it has any flag which changes between XML 1.0 and 1.1 validation; the Python code here implements the 1.0 validation.

It's also worthwhile to note that this is actually stricter than the bytes version, as that only considers invalid ASCII characters as making the string invalid, just ignoring everything else.