UTF8 encoding problems in minimal Ubuntu for CI

cpitclaudel / alectryon

A collection of tools for writing technical documents that mix Coq code and prose.

MIT License

228 stars 36 forks source link

UTF8 encoding problems in minimal Ubuntu for CI #11

Closed palmskog closed 3 years ago

palmskog commented 3 years ago

I set up a custom Docker container with Ubuntu (Dockerfile) to be able to run Alectryon with coqdoc on every master branch push for a Coq project. However, I quickly ran into UTF8 encoding issues like this:

'ascii' codec can't encode character '\u2191' in position 6443: ordinal not in range(128)

Note that \u2191 is the "uparrow" Unicode symbol, so the problem came from the use of HEADER in alectryon/html.py.

Even after reading up on Python3 encoding issues, I couldn't figure out exactly where there might be a .encode("utf-8") missing, so I opted to simply remove all UTF8 from all output by Alectryon and coqdoc. However, since the --utf8 option to coqdoc is hardcoded, I had to use a fork of Alectryon (commit). Also, I believe this means the build will break anytime anyone uses an UTF8 character in a Coq file.

Is there a better way to solve this issue? I theorize that one more complete workaround would be to set up a locale (e.g., en_US.UTF8) in the Docker container, but this seems like a cumbersome thing to do in every Docker image where one wants to run Alectryon.

cpitclaudel commented 3 years ago

Thanks a lot for the report. Do you have a complete backtrace? (you can get one by passing --traceback to Alectryon). The reason I'm asking is that Alectryon doesn't really print much to stdout, so this error seems to mean that programs in that docker container can't even write files that contain non-ascii characters.

I think the solution is here: https://stackoverflow.com/questions/52065842/python-docker-ascii-codec-cant-encode-character (ignore the incorrect duplicate banner)

I wonder if this is the same problem as the one that forced @jfehrle to catch encoding exceptions in https://github.com/coq/coq/pull/13564/files#diff-99858e5d76716d34bcaf9ad38b8d67f05a7a8849e7969faa8b2318805d94f223R219 .

Also, I believe this means the build will break anytime anyone uses an UTF8 character in a Coq file. […] I theorize that one more complete workaround would be to set up a locale (e.g., en_US.UTF8) in the Docker container, but this seems like a cumbersome thing to do

I think that's the right solution, precisely because of your point on non-ascii characters in Coq files. Fortunately it looks easy (ENV LANG en_US.utf8); once we confirm that this works, I'll add a note in the readme.

palmskog commented 3 years ago

Complete command and backtrace from inside the container:

user@eaac613822d7:~/casper-cbc-proofs$ ~/alectryon/alectryon.py --frontend coqdoc --webpage-style windowed --traceback -Q . CasperCBC --output-directory tmp Lib/Classes.v
Traceback (most recent call last):
  File "/home/user/alectryon/alectryon.py", line 26, in <module>
    main()
  File "/home/user/alectryon/alectryon/cli.py", line 631, in main
    process_pipelines(args)
  File "/home/user/alectryon/alectryon/cli.py", line 623, in process_pipelines
    raise e
  File "/home/user/alectryon/alectryon/cli.py", line 620, in process_pipelines
    state = call_pipeline_step(step, state, ctx)
  File "/home/user/alectryon/alectryon/cli.py", line 589, in call_pipeline_step
    return step(state, **{p: ctx[p] for p in params})
  File "/home/user/alectryon/alectryon/cli.py", line 326, in <lambda>
    write_output(ext, contents, fname, output, output_directory)
  File "/home/user/alectryon/alectryon/cli.py", line 322, in write_output
    f.write(contents)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2191' in position 6441: ordinal not in range(128)

palmskog commented 3 years ago

@cpitclaudel it actually seems as though the following diff for cli.py solves the issue completely, even with LANG=C:

@@ -318,7 +318,7 @@ def write_output(ext, contents, fname, output, output_directory):
     else:
         if not output:
             output = os.path.join(output_directory, strip_extension(fname) + ext)
-        with open(output, mode="w") as f:
+        with open(output, mode="w", encoding="utf-8") as f:
             f.write(contents)

 def write_file(ext):

Since the whole project is supposed to be UTF8 anyway, would a PR with this change be welcome? To me, this would be a better fix than remembering to change LANG everywhere.

jfehrle commented 3 years ago

I was looking at export PYTHONIOENCODING=utf8 which is described here: https://stackoverflow.com/questions/2276200/changing-default-encoding-of-python. That could be added to the makefile and Dune. (I just did the bandaid fix of catching the encoding exception because my change is temporary.)

I recall thinking that maybe it should be "utf-8" but didn't figure out if that's correct.

Jim

On Mon, Dec 14, 2020 at 9:09 AM Karl Palmskog notifications@github.com wrote:

@cpitclaudel https://github.com/cpitclaudel it actually seems as though the following diff for cli.py solves the issue completely, even with LANG=C:

@@ -318,7 +318,7 @@ def write_output(ext, contents, fname, output, output_directory): else: if not output: output = os.path.join(output_directory, strip_extension(fname) + ext)- with open(output, mode="w") as f:+ with open(output, mode="w", encoding="utf-8") as f: f.write(contents)

def write_file(ext):

Since the whole project is supposed to be UTF8 anyway, would a PR with this change be welcome? To me, this would be a better fix than remembering to change LANG everywhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cpitclaudel/alectryon/issues/11#issuecomment-744580736, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJR7XNFP2NKPDIYGPQXZJ3SUZBF3ANCNFSM4U2U2DNQ .

jfehrle commented 3 years ago

would a PR with this change be welcome?

I think that would be a good idea.

On Mon, Dec 14, 2020 at 11:10 AM Jim Fehrle jim.fehrle@gmail.com wrote:

I was looking at export PYTHONIOENCODING=utf8 which is described here: https://stackoverflow.com/questions/2276200/changing-default-encoding-of-python. That could be added to the makefile and Dune. (I just did the bandaid fix of catching the encoding exception because my change is temporary.)

I recall thinking that maybe it should be "utf-8" but didn't figure out if that's correct.

Jim

On Mon, Dec 14, 2020 at 9:09 AM Karl Palmskog notifications@github.com wrote:

@cpitclaudel https://github.com/cpitclaudel it actually seems as though the following diff for cli.py solves the issue completely, even with LANG=C:

@@ -318,7 +318,7 @@ def write_output(ext, contents, fname, output, output_directory): else: if not output: output = os.path.join(output_directory, strip_extension(fname) + ext)- with open(output, mode="w") as f:+ with open(output, mode="w", encoding="utf-8") as f: f.write(contents)

def write_file(ext):

Since the whole project is supposed to be UTF8 anyway, would a PR with this change be welcome? To me, this would be a better fix than remembering to change LANG everywhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cpitclaudel/alectryon/issues/11#issuecomment-744580736, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJR7XNFP2NKPDIYGPQXZJ3SUZBF3ANCNFSM4U2U2DNQ .