Closed kmccurley closed 1 year ago
This is very annoying. I will have a closer look at this as well.
I have a fix for this to first try reading as UTF-8 in a try...catch
, and then try reading as ISO-8859-1 with errors='ignore'
. This would almost always give the correct result and the only failure mode would be to drop characters if there are mixed encodings from pdflatex.
I suspect that it's possible to form a LaTeX file that uses multiple character encodings with something like \usepackage[T1,T2,TS1]{fontenc}
and then end up with a main.log that has multiple encodings and is therefore only readable as binary. This is the same thing as the example in our paper where \write
generates mixed character encodings. This would be caused by pdflatex, which can only handle single byte encodings. I'm also pretty sure that there is no well-defined character encoding for T1 or T2 in python but since we might get mixed encodings, I don't think it matters.
There are specific characters in the T1/cork encoding that are different from iso-8859-1. An example is the Hungarian letter ű which is 0xB6 in T1/cork, but this byte represents the paragraph symbol ¶ in ISO-8859-1. This byte does not represent the start of any character in UTF-8.
This was fixed by first trying to decode as UTF-8, and then trying to decode as iso-8859-1 but handle errors. Some characters might be replaced with placeholders in the error reporting, but given that we will have much better reporting of files, line numbers, and page numbers, this shouldn't be a problem.
I managed to discover a situation in which the input encoding on a paper and the bibtex is UTF-8, but the encoding on main.log was ISO-8859-1 if you use pdflatex but UTF-8 if you use lualatex. The example is IACR/latex/iacrcc/tests/test3 that uses a fairly large bibtex file with UTF-8 in it. This causes the server code to bomb in routes.py where it reads main.log with:
The error message is:
The solution to this is either to never use pdflatex (we're back to that again), or wrap the
read_text
with atry ... except
to make sure that it can read the file.I'm not sure if it matters, but the offending character is ö which is 0xF6 in iso-8859-1. This is one of many examples where ISO-8859-1 cannot be read as utf-8.