Closed Janno closed 10 months ago
I suspect something is splitting raw bytes instead of unicode code points.
As far as I'm aware, Coq error message locations and coqc -time
character counts and .glob file locations all use raw bytes rather than Unicode code points.
The ty = 'lib'
bit suggests that this is the code that is processing .glob files to absolutize Require
statements. Probably you're looking at this line?
https://github.com/JasonGross/coq-tools/blob/268f55e5ba380a6bfcc3dfad246dcc656db8642a/coq_tools/import_util.py#L610
Unfortunately, what you posted isn't enough for me to determine what's going wrong, but I can direct you in debugging:
Require
statements and the Unicode characters and not much else. The file still needs to be valid Coq (we need to get a glob file from coqc), but you can replace the error you're trying to minimize with Fail Check Type.
or similar. And while the Require
statements may need to stay the same, the contents of the files being Require
d can be empty.lib
. The Rnnn:mmm
at the beginning of the line are supposed to be byte ranges, I believe, and this chunk of Python code is supposed to be splitting statements around Require
and libraries, so for it to identify an Axiom
line suggests the counts are way off (more so the further this line is from Require
)This is indeed the correct line. Sorry for not including the backtrace. I didn't realize that the tool actually outputs the backtrace without all the debug info (including the input file) at the very end. So here's the backtrace (for a different unicde symbol):
Traceback (most recent call last):
File "/home/janno/src/coq-tools/find-bug.py", line 5, in <module>
sys.exit(main())
^^^^^^
File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1479, in main
minimize_file(output_file_name, **env)
File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1149, in minimize_file
try_split_requires(output_file_name, **env)
File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1022, in try_split_requires
new_contents = ''.join(v[0].decode('utf-8') for v in annotated_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/janno/src/coq-tools/coq_tools/find_bug.py", line 1022, in <genexpr>
new_contents = ''.join(v[0].decode('utf-8') for v in annotated_contents)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/janno/src/coq-tools/coq_tools/import_util.py", line 610, in split_requires_of_statements
statement_str = statement.decode('utf-8')
^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: unexpected end of data
As far as I'm aware, Coq error message locations and coqc -time character counts and .glob file locations all use raw bytes rather than Unicode code points.
That makes sense, I suppose. But it seems to me that annotated_contants
contains pretty chaotic splits so it is definitely not safe to call decode
on the items inside. At the very start of the file they seem to somewhat correlate to statements but it quickly drifts off. The backtrace above corresponds to the following entry in annotated_contents
:
, (b' split. {\n intros r; funext; case; [|done..]; intros c.\n apply (f_', ())
, (b'equal (', ())
, (b'\xce', ())
, (b'\xbb r, r ', ())
which comes from these lines in the input file:
split. {
intros r; funext; case; [|done..]; intros c.
apply (f_equal (λ r, r c) (iso.f_g _ cpu_isa_regs.fun_iso_laws (r `comp` CPU_REG))).
}
(Note the λ
unicode symbol.)
The splits seem to first break around Require
lines:
, (b'Require Import bedrock.hw_models.cpu bedrock.hw_models.cpu.aa', ((42, 60, 'bedrock.hw_models.cpu.aarch64_fault_regs', '<>', 'lib'), (38, 41, 'bedrock.hw_models.cpu', '<>', 'lib'))), (b'r', ())
, (b'ch64_fault_regs.\nRequire Import bedrock.nova_inte', ((43, 48, 'bedrock.nova_interface.model', '<>', 'lib'),)), (b'r', ())
, (b'face.model.\nRequire Import bedrock.prelude.axioms.', ((43, 49, 'bedrock.prelude.axioms.funext', '<>', 'lib'),)), (b'fu', ())
, (b'next.\n\nRequire Import bedrock.base_logic.l', ((15, 41, 'bedrock.base_logic.lib.sts', '<>', 'lib'),)), (b'ib', ())
, (b'.sts.\n\nInfix "`union`" := union (at lev', ())
which represents (modulo some rewriting of imports)
From bedrock.hw_models Require Import cpu aarch64_fault_regs.
From bedrock.nova_interface Require Import model.
From bedrock.prelude.axioms Require Import funext.
Require Import bedrock.base_logic.lib.sts.
Infix "`union`" := union (at level 12).
It seems that the lib
annotations pointing into the middle of the module name offset all later splits. Note how the Infix
line is broken up in the middle of level
.
What's really weird is the following: the output file produced contains the original formulation of the import:
From bedrock.hw_models Require Import cpu aarch64_fault_regs.
whereas annotated_contents
(see above) seems to operate on
Require Import bedrock.hw_models.cpu bedrock.hw_models.cpu.aarch64.
It seems as if the minimizer transformed a rewriting on the From .. Require
statement but then used the offsets from the original glob file. Is that possible?
I checked the output again and I now see this here which might explain why the .glob
file is out of date:
make -k -f Makefilessw5xofk.coq KEEP_ERROR=1 fmdeps/cpp2v/tests/bug5_minimal.glob
make: /bin/sh: Argument list too long
COQDEP VFILES
make: /bin/sh: Argument list too long
make: *** No rule to make target 'fmdeps/cpp2v/tests/bug5_minimal.glob'.
The solution seems to be passing --no-deps
.
It might make sense to turn a failure to generate the .glob
file into a hard error given the drift between locations and file content that happens otherwise.
The make: /bin/sh: Argument list too long
might be a Coq bug, if you can figure out how to reproduce that a fix can probably be applied as in https://github.com/coq/coq/pull/17697
It seems as if the minimizer transformed a rewriting on the
From .. Require
statement but then used the offsets from the original glob file. Is that possible?
Yes, when reading in files we need to go from From ... Require ...
to Require <absolute libname>
to deal with the fact that we might be passing different flags to coqc
(consider inlining things from user-contrib). .glob files and offsets are processed in a single pass, in reverse order, from the end of the file to the beginning, to account for this change; changes to the text at later positions will not break earlier positions.
In general the minimizer tries to be robust to unexpected failures (including outdated .glob files; I have some checks to ensure that the decoded statement is in fact a From ... Require ...
or w/e, with the correct library IIRC, before blindly transforming it), so I think the thing to do here is just turn this error into a warning about possibly outdated glob files. The result should just be that some dependencies cannot be inlined, not that the resulting file is incorrect.
I can also make the minimizer emit a suggestion about --no-deps
when Argument list too long
and No rule to make target '.*.glob'
both appear in the output of make
.
I can't post the original example because the sources involved are not open source but I suspect the problem is easy enough to find from this part here:
The code around this excerpt looks like this:
The UTF-8 representation of
→
is0xE2 0x86 0x92
.I suspect something is splitting raw bytes instead of unicode code points.
I cannot easily reproduce the problem with smaller files so there might be some buffering/chunking going on? Maybe a chunk's border lands between the three bytes required for
→
?