main.log turned out to be ISO-8859.1

IACR / latex-submit

Web server to receive uploaded LaTeX and execute it in a docker container.

GNU Affero General Public License v3.0

11 stars 0 forks source link

main.log turned out to be ISO-8859.1 #26

Closed kmccurley closed 1 year ago

kmccurley commented 1 year ago

I managed to discover a situation in which the input encoding on a paper and the bibtex is UTF-8, but the encoding on main.log was ISO-8859-1 if you use pdflatex but UTF-8 if you use lualatex. The example is IACR/latex/iacrcc/tests/test3 that uses a fairly large bibtex file with UTF-8 in it. This causes the server code to bomb in routes.py where it reads main.log with:

data['latexlog'] = log_file.read_text(encoding='UTF-8')

The error message is:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 51523: invalid start byte

The solution to this is either to never use pdflatex (we're back to that again), or wrap the read_text with a try ... except to make sure that it can read the file.

I'm not sure if it matters, but the offending character is ö which is 0xF6 in iso-8859-1. This is one of many examples where ISO-8859-1 cannot be read as utf-8.

jwbos commented 1 year ago

This is very annoying. I will have a closer look at this as well.

kmccurley commented 1 year ago

I have a fix for this to first try reading as UTF-8 in a try...catch, and then try reading as ISO-8859-1 with errors='ignore'. This would almost always give the correct result and the only failure mode would be to drop characters if there are mixed encodings from pdflatex.

I suspect that it's possible to form a LaTeX file that uses multiple character encodings with something like \usepackage[T1,T2,TS1]{fontenc} and then end up with a main.log that has multiple encodings and is therefore only readable as binary. This is the same thing as the example in our paper where \write generates mixed character encodings. This would be caused by pdflatex, which can only handle single byte encodings. I'm also pretty sure that there is no well-defined character encoding for T1 or T2 in python but since we might get mixed encodings, I don't think it matters.

kmccurley commented 1 year ago

There are specific characters in the T1/cork encoding that are different from iso-8859-1. An example is the Hungarian letter ű which is 0xB6 in T1/cork, but this byte represents the paragraph symbol ¶ in ISO-8859-1. This byte does not represent the start of any character in UTF-8.

kmccurley commented 1 year ago

This was fixed by first trying to decode as UTF-8, and then trying to decode as iso-8859-1 but handle errors. Some characters might be replaced with placeholders in the error reporting, but given that we will have much better reporting of files, line numbers, and page numbers, this shouldn't be a problem.