RagnarGrootKoerkamp / BAPCtools

Tools for developing ICPC-style programming contest problems.
GNU General Public License v3.0
48 stars 18 forks source link

`bt pdf` encoding issue #367

Closed chrisgdt closed 3 months ago

chrisgdt commented 3 months ago

Hello !

After an update of the BAPCtools, I compiled my contest but encountered an error:

contest.fr.pdf: Building PDF for language fr
Traceback (most recent call last):
  File "[...]/BAPCtools/bin/tools.py", line 1050, in <module>
    main()
  File "[...]/BAPCtools/bin/tools.py", line 1046, in main
    run_parsed_arguments(parser.parse_args())
  File "[...]/BAPCtools/bin/tools.py", line 965, in run_parsed_arguments
    success &= latex.build_contest_pdfs(contest, problems, tmpdir, web=config.args.web)
  File "[...]/BAPCtools/bin/latex.py", line 430, in build_contest_pdfs
    [build_contest_pdf(contest, problems, tmpdir, lang, solutions, web) for lang in languages]
  File "[...]/BAPCtools/bin/latex.py", line 430, in <listcomp>
    [build_contest_pdf(contest, problems, tmpdir, lang, solutions, web) for lang in languages]
  File "[...]/BAPCtools/bin/latex.py", line 400, in build_contest_pdf
    return build_latex_pdf(builddir, Path(main_file), language, bar)
  File "[...]/BAPCtools/bin/latex.py", line 199, in build_latex_pdf
    ret.out = outfile.read_text(encoding=None)
  File "/usr/lib/python3.10/pathlib.py", line 1135, in read_text
    return f.read()
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 19677: invalid continuation byte

I did some investigations and found out that this issue appeared around the commit "fix hang", two weeks ago. More precisely, from latex.py, line 199.

The pdf does not compile and produces this error when there is an uncommon character in the problem contest, such as é or è (often used in French). I did find a way to fix it, which is to modify line 199 by setting another encoding instead of utf-8, e.g,

ret.err = errfile.read_text(encoding='Latin-1')  # not used
ret.out = outfile.read_text(encoding='Latin-1')

but since I am not sure whether this is a good idea to change the encoding, nor whether the issue is more grounded somewhere else, I prefer to open an issue instead of a PR to discuss it. Notice that replacing every é by \'e does not work either.

RagnarGrootKoerkamp commented 3 months ago

cc @mzuenni

So it seems that latexmk encodes its terminal output in Latin-1 instead of utf-8. Weird/annoying. A simple fix is probably to change read_text to read_binary (or so), and do the conversion to string 'manually' in python, where we can catch errors or try multiple encodings. Could you try that?

What platform are you running on? Just some linux?

I am curious where the non-utf8 encoding comes from. You could try doing everything as utf-8 always: https://stackoverflow.com/a/1253024/2716069

chrisgdt commented 3 months ago

Yes, I forgot to mention my configuration, my apologize :

mzuenni commented 3 months ago

It seems like the stuff printed by pdflatex is not actually encoded in Latin-1 but whatever latex is using internally, see https://tex.stackexchange.com/questions/131238/what-controls-the-encoding-of-the-latex-log-file-and-how-to-change-it. Unfortunately, that depends on stuff like the latex font used at the place where the error occured...

Anyway, back on topic. Why does Latin-1 seems to fix this? Well... the é or è likely appear in text and not stuff like mathcal and at such a place you likely use a T1 font and luckily enough T1 is equal to Latin-1 for most stuff. However, it's not really the right encoding... People not using T1 fonts would need a different fix. And in fact the right encoding doesn't even exists because stuff like matcal does not even use a "real" encoding... and if multiple fonts are used the log file can contain errors for all of them. So i guess the fix here is to ignore encoding errors and live with weird looking error messages in such places.

mzuenni commented 3 months ago

fixed with 9ca3e8b ?

chrisgdt commented 3 months ago

fixed with 9ca3e8b ?

I confirm, thank you !