Failure to decode cut-off UTF-8

Janno commented 10 months ago

I can't post the original example because the sources involved are not open source but I suspect the problem is easy enough to find from this part here:

    statement = b'ome x.\n      Axiom to_of_N : \xe2\x88\x80 n x, of_N n = Some x \xe2\x86'
    statement_str = ' x) = S'
    ty = 'lib'
    types = ('lib',)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 54-55: unexpected end of data

The code around this excerpt looks like this:

      Axiom of_to_N : ∀ x, of_N (to_N x) = Some x.
      Axiom to_of_N : ∀ n x, of_N n = Some x → to_N x = n.

The UTF-8 representation of → is 0xE2 0x86 0x92.

I suspect something is splitting raw bytes instead of unicode code points.

I cannot easily reproduce the problem with smaller files so there might be some buffering/chunking going on? Maybe a chunk's border lands between the three bytes required for →?

JasonGross commented 10 months ago

I suspect something is splitting raw bytes instead of unicode code points.

As far as I'm aware, Coq error message locations and coqc -time character counts and .glob file locations all use raw bytes rather than Unicode code points.

The ty = 'lib' bit suggests that this is the code that is processing .glob files to absolutize Require statements. Probably you're looking at this line? https://github.com/JasonGross/coq-tools/blob/268f55e5ba380a6bfcc3dfad246dcc656db8642a/coq_tools/import_util.py#L610

Unfortunately, what you posted isn't enough for me to determine what's going wrong, but I can direct you in debugging:

If you're trying to find a smaller file that reproduces this bug in the minimizer, you'll probably need to keep the Require statements and the Unicode characters and not much else. The file still needs to be valid Coq (we need to get a glob file from coqc), but you can replace the error you're trying to minimize with Fail Check Type. or similar. And while the Require statements may need to stay the same, the contents of the files being Required can be empty.
To dig into this by hand, take a look at the generated .glob file, in particular lines with lib. The Rnnn:mmm at the beginning of the line are supposed to be byte ranges, I believe, and this chunk of Python code is supposed to be splitting statements around Require and libraries, so for it to identify an Axiom line suggests the counts are way off (more so the further this line is from Require)

Janno commented 10 months ago

https://github.com/JasonGross/coq-tools/blob/268f55e5ba380a6bfcc3dfad246dcc656db8642a/coq_tools/import_util.py#L610

This is indeed the correct line. Sorry for not including the backtrace. I didn't realize that the tool actually outputs the backtrace without all the debug info (including the input file) at the very end. So here's the backtrace (for a different unicde symbol):

Traceback (most recent call last):
  File "/home/janno/src/coq-tools/find-bug.py", line 5, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1479, in main
    minimize_file(output_file_name, **env)
  File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1149, in minimize_file
    try_split_requires(output_file_name, **env)
  File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1022, in try_split_requires
    new_contents = ''.join(v[0].decode('utf-8') for v in annotated_contents)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1022, in <genexpr>
    new_contents = ''.join(v[0].decode('utf-8') for v in annotated_contents)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/janno/src/coq-tools/coq_tools/import_util.py", line 610, in split_requires_of_statements
    statement_str = statement.decode('utf-8')
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: unexpected end of data

As far as I'm aware, Coq error message locations and coqc -time character counts and .glob file locations all use raw bytes rather than Unicode code points.

That makes sense, I suppose. But it seems to me that annotated_contants contains pretty chaotic splits so it is definitely not safe to call decode on the items inside. At the very start of the file they seem to somewhat correlate to statements but it quickly drifts off. The backtrace above corresponds to the following entry in annotated_contents:

, (b'      split. {\n        intros r; funext; case; [|done..]; intros c.\n        apply (f_', ())
, (b'equal (', ())
, (b'\xce', ())
, (b'\xbb r, r ', ())

which comes from these lines in the input file:

      split. {
        intros r; funext; case; [|done..]; intros c.
        apply (f_equal (λ r, r c) (iso.f_g _ cpu_isa_regs.fun_iso_laws (r `comp` CPU_REG))).
      }

(Note the λ unicode symbol.)

The splits seem to first break around Require lines:

, (b'Require Import bedrock.hw_models.cpu bedrock.hw_models.cpu.aa', ((42, 60, 'bedrock.hw_models.cpu.aarch64_fault_regs', '<>', 'lib'), (38, 41, 'bedrock.hw_models.cpu', '<>', 'lib'))), (b'r', ())
, (b'ch64_fault_regs.\nRequire Import bedrock.nova_inte', ((43, 48, 'bedrock.nova_interface.model', '<>', 'lib'),)), (b'r', ())
, (b'face.model.\nRequire Import bedrock.prelude.axioms.', ((43, 49, 'bedrock.prelude.axioms.funext', '<>', 'lib'),)), (b'fu', ())
, (b'next.\n\nRequire Import bedrock.base_logic.l', ((15, 41, 'bedrock.base_logic.lib.sts', '<>', 'lib'),)), (b'ib', ())
, (b'.sts.\n\nInfix "`union`" := union (at lev', ())

which represents (modulo some rewriting of imports)

From bedrock.hw_models Require Import cpu aarch64_fault_regs.
From bedrock.nova_interface Require Import model.
From bedrock.prelude.axioms Require Import funext.

Require Import bedrock.base_logic.lib.sts.

Infix "`union`" := union (at level 12).

It seems that the lib annotations pointing into the middle of the module name offset all later splits. Note how the Infix line is broken up in the middle of level.

What's really weird is the following: the output file produced contains the original formulation of the import:

From bedrock.hw_models Require Import cpu aarch64_fault_regs.

whereas annotated_contents (see above) seems to operate on

Require Import bedrock.hw_models.cpu bedrock.hw_models.cpu.aarch64.

It seems as if the minimizer transformed a rewriting on the From .. Require statement but then used the offsets from the original glob file. Is that possible?

Janno commented 10 months ago

I checked the output again and I now see this here which might explain why the .glob file is out of date:

make -k -f Makefilessw5xofk.coq KEEP_ERROR=1 fmdeps/cpp2v/tests/bug5_minimal.glob
make: /bin/sh: Argument list too long
COQDEP VFILES
make: /bin/sh: Argument list too long
make: *** No rule to make target 'fmdeps/cpp2v/tests/bug5_minimal.glob'.

Janno commented 10 months ago

The solution seems to be passing --no-deps.

It might make sense to turn a failure to generate the .glob file into a hard error given the drift between locations and file content that happens otherwise.

JasonGross commented 10 months ago

The make: /bin/sh: Argument list too long might be a Coq bug, if you can figure out how to reproduce that a fix can probably be applied as in https://github.com/coq/coq/pull/17697

It seems as if the minimizer transformed a rewriting on the From .. Require statement but then used the offsets from the original glob file. Is that possible?

Yes, when reading in files we need to go from From ... Require ... to Require <absolute libname> to deal with the fact that we might be passing different flags to coqc (consider inlining things from user-contrib). .glob files and offsets are processed in a single pass, in reverse order, from the end of the file to the beginning, to account for this change; changes to the text at later positions will not break earlier positions.

In general the minimizer tries to be robust to unexpected failures (including outdated .glob files; I have some checks to ensure that the decoded statement is in fact a From ... Require ... or w/e, with the correct library IIRC, before blindly transforming it), so I think the thing to do here is just turn this error into a warning about possibly outdated glob files. The result should just be that some dependencies cannot be inlined, not that the resulting file is incorrect.

I can also make the minimizer emit a suggestion about --no-deps when Argument list too long and No rule to make target '.*.glob' both appear in the output of make.

JasonGross / coq-tools

Failure to decode cut-off UTF-8 #176