Closed brebs-gh closed 2 years ago
Nasty :cry: @kamahen, any clue. Considering it fails silently, a failing re_matchsub/3 is the most likely. I wonder whether it can be locale related or how Unicode has been configured for pcre on this machine.
I'm presuming that even though it's a S390, it's using ASCII (and Unicode). The PCRE2 documentation for EBCDIC says (https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC3) that with EBCDIC there are no code points greater than 255, so I would presume that many other tests would fail if this were EBCDIC.
For the "how Unicode is configured" hypothesis, we could try this:
re_config(unicode_version(V))
which gives 13.0.0
on my Chromebook and 12.1.0
on my Ubuntu 20.04.4 system (those are all the systems I have available).
The obvious thing to do is to add some debug output and run ctest -V -R pcre
:
diff --git a/test_pcre.pl b/test_pcre.pl
index a508607..35ecc69 100644
--- a/test_pcre.pl
+++ b/test_pcre.pl
@@ -822,6 +822,10 @@ re_test(write, Result == re_match{0:"хелло 蛇"}) :-
% Make sure that write_pcre() works witn non-ASCII.
re_compile('хелло 蛇', Re, []),
with_output_to(string(ReStr), write(current_output, Re)),
+ format('***WRITE_REGEX1(~q)~n', [ReStr]),
+ format('***WRITE_REGEX2(~q)~n', [Re]),
+ re_compile('Hello', Re2, []),
+ format('***WRITE_REGEX3(~q)~n', [Re2]),
re_matchsub(Re, ReStr, Result),
assertion(re_match("^<regex>\\(.*\\)$", ReStr)).
On my system, this outputs:
***WRITE_REGEX1("<regex>(0x57bd8f75d0a0, /хелло 蛇/)")
***WRITE_REGEX2(<regex>(0x57bd8f75d0a0, /хелло 蛇/))
***WRITE_REGEX3(<regex>(0x57bd8f75caf0, /Hello/))
Hope it is easy enough to get the result with these debug outputs. And, it may be an s390, but it is running Alpine Linux :smile:
Besides the unicode version, it would be useful to get the PCRE2 version:
``re_config(version(V))
.
This gives 10.34 2019-11-21
on Ubuntu 20.04.0 and 10.36 2020-12-04
on my Chromebook (Debian 11).
This isn't an important test (I put it in when I fixed the bug that could show two different regex blobs as equal when they weren't), but it implies some worrisome things deeper in the system. I'm surprised that other Unicode-related tests didn't also fail.
One thing that's different about this test is that it has Cyrillic letters in it (the others have Kanji, accented Latin, and a few special symbols). If the locale is 8859-5, then code points above 127 are used for Cyrillic -- could that be an issue here (together with how the format %Ws
works)? In Unicode, Cyrillic is in the range U+0400–U+045F.
It also would be useful to know the output of the locale
command.
s390x output: https://gitlab.alpinelinux.org/brebs/aports/-/jobs/700651#L3633
Running locale
LANG=
LC_CTYPE=POSIX
LC_NUMERIC=POSIX
LC_TIME=POSIX
LC_COLLATE=POSIX
LC_MONETARY=POSIX
LC_MESSAGES=POSIX
LC_ALL=
***WRITE_REGEX1("<regex>(0x3ffbcc72b70, //)")
72: ***WRITE_REGEX2(<regex>(0x3ffbcc72b70, //))
72: ***WRITE_REGEX3(<regex>(0x3ffbcc72c50, //))
For comparison: x86_64 output: https://gitlab.alpinelinux.org/brebs/aports/-/jobs/700649#L3634
***WRITE_REGEX1("<regex>(0x7f4004f711c0, /хелло 蛇/)")
72: ***WRITE_REGEX2(<regex>(0x7f4004f711c0, /хелло 蛇/))
72: ***WRITE_REGEX3(<regex>(0x7f4004f715b0, /Hello/))
In Alpine Edge running swi-prolog 8.5.10:
?- re_config(version(V)).
V = '10.39 2021-10-29'.
It's interesting that the wchar strings are empty, even when they're pure ascii. I tried reproducing the problem by setting the local environment variables; that broke a few other tests (swipl:files
, jpl:java_in_prolog
) and got a build warning which I don't understand because it doesn't seem to have anything multibyte at those lines:
Warning: /home/peter/src/swipl-devel/build/home/library/aggregate.pl:600:
Warning: '/home/peter/src/swipl-devel/build/home/library/aggregate.pl':606:34: Illegal multibyte Sequence
Warning: /home/peter/src/swipl-devel/packages/jpl/test_jpl.pl:1507:
Warning: '/home/peter/src/swipl-devel/packages/jpl/test_jpl.pl':1510:15: Illegal multibyte Sequence
I also found this in https://en.wikipedia.org/wiki/Wide_character#C/C++ : "The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."
So, it could be useful to find out what this program outputs (it gives 4 on my system):
#include <stdio.h>
#include <stddef.h>
int main(void) {
printf("%lu\n", sizeof (wchar_t));
return 0;
}
Here's the output from the test swipl:files
:
15: Test command: /home/peter/src/swipl-devel/build/src/swipl "-f" "none" "--no-packs" "--on-error=status" "-q" "/home/peter/src/swipl-devel/src/test.pl" "--no-core" "files"
15: Test timeout computed to be: 10000000
15: Running scripts from files ..............................
15: ERROR: /home/peter/src/swipl-devel/src/Tests/files/test_glob.pl:135:
15: test cyrillic is subject to occurs check (STO):
15: ERROR: Finite trees (error checking): received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U0461 using current locale encoding)
15: ERROR: Rational trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U045e using current locale encoding)
15: ERROR: Finite trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U0400 using current locale encoding)
15: ERROR: /home/peter/src/swipl-devel/src/Tests/files/test_glob.pl:138:
15: test cyrillic is subject to occurs check (STO):
15: ERROR: Finite trees (error checking): received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U0447 using current locale encoding)
15: ERROR: Rational trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U0459 using current locale encoding)
15: ERROR: Finite trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U04ac using current locale encoding)
15: ERROR: /home/peter/src/swipl-devel/src/Tests/files/test_glob.pl:142:
15: test cyrillic is subject to occurs check (STO):
15: ERROR: Finite trees (error checking): received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U04b9 using current locale encoding)
15: ERROR: Rational trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U04a6 using current locale encoding)
15: ERROR: Finite trees: received error: file_directory_name/2: Cannot represent due to `encoding' (Cannot represent char U04d8 using current locale encoding)
15: Script /home/peter/src/swipl-devel/src/Tests/files/test_glob.pl failed
15: done
15: *** 1 tests failed ***
1/1 Test #15: swipl:files ......................***Failed 0.21 sec
Also, please run this command (requires the the latest source code for packages/pcre):
?- forall(re_config(Z), writeln(Z)).
bsr2(unicode)
compiled_widths(7)
depthlimit(10000000)
heaplimit(20000000)
jit(true)
jittarget(x86 64bit (little endian + unaligned))
linksize(2)
matchlimit(10000000)
never_backslash_c(false)
newline2(lf)
parenslimit(250)
stackrecurse(false)
unicode(true)
unicode_version(13.0.0)
version(10.36 2020-12-04)
Here's the output from the test
swipl:files
:
This is normal. On POSIX systems file names are char strings using the current locale encoding. So, Prolog cannot access any files that cannot be encoded in the locale you give it. These tests require a UTF-8 based locale.
Otherwise I have to think what this means. Seems most of the Unicode stuff works fine on this machine. Might indeed be useful to know about sizeof(wchar_t).
Results from s390x when compiling swi-prolog 8.5.11:
wchar_t has size 4
Output from src/swipl -g 'forall(re_config(Z), writeln(Z)), halt'
:
Running re_config
bsr2(unicode)
compiled_widths(7)
depthlimit(8192)
heaplimit(20000000)
jit(false)
linksize(2)
matchlimit(10000000)
never_backslash_c(false)
newline2(lf)
parenslimit(250)
stackrecurse(false)
unicode(true)
unicode_version(14.0.0)
version(10.39 2021-10-29)
That differs to x86_64 in:
jit(true)
jittarget(x86 64bit (little endian + unaligned))
Hmmm ... I've tested with jit turned off, but perhaps it wasn't thorough enough. I'll look into this in about a week -- hope that's OK for you.
I suspect that the lack of JIT has nothing to do with this bug. I'll put together some additional unit tests for packages-cpp, that should help in isolating the problem. My preferred way of doing this is a PR that contains some additional debug print statements, which I'll remove either before the PR is accepted or can be removed in a future PR (the debug output would only be visible if you run ctest -V
I've created https://github.com/SWI-Prolog/packages-cpp/pull/12 (or https://github.com/kamahen/packages-cpp/commit/2b7ff98ae08d4f92e1c6d4ba2f27d4c83fd3ace7) for a test that doesn't use PCRE2.
@brebs-gh Please rebuild with this and run with ctest -V -R cpp
. This is the expected output:
test 42
Start 42: cpp:ffi
42: Test command: /home/peter/src/swipl-devel/build/src/swipl "-p" "foreign=" "-f" "none" "--no-packs" "--on-error=status" "-s" "/home/peter/src/swipl-devel/packages/cpp/test_ffi.pl" "-g" "test_ffi" "-t" "halt"
42: Test timeout computed to be: 10000000
42: % PL-Unit: ffi ....................***(W_ATOM_FFI/)***/0
42: ***(W_ATOM_FFI/ )***/1
42: ***(W_ATOM_FFI/abC)***/3
42: ***(W_ATOM_FFI/Hello World!)***/12
42: ***(W_ATOM_FFI/хелло)***/5
42: ***(W_ATOM_FFI/хелло 蛇)***/7
42: ***(W_ATOM_FFI/網目錦へび [àmímé níshíkíhéꜜbì])***/26
42: .. done
42: % All 22 tests passed
1/1 Test #42: cpp:ffi .......................... Passed 0.15 sec
The following tests passed:
cpp:ffi
Results for s390x:
42: Test command: /builds/brebs/aports/testing/swi-prolog/src/swipl-8.5.11/src/swipl "-p" "foreign=" "-f" "none" "--no-packs" "--on-error=status" "-s" "/builds/brebs/aports/testing/swi-prolog/src/swipl-8.5.11/packages/cpp/test_ffi.pl" "-g" "test_ffi" "-t" "halt"
42: Test timeout computed to be: 10000000
42: % PL-Unit: ffi .................... done
42: % PL-Unit: wchar *** W_ATOM_FFI// ***/0
42: *** W_ATOM_FFI/ / ***/1
42: *** W_ATOM_FFI/abC/ ***/3
42: *** W_ATOM_FFI/Hello World!/ ***/12
42: *** W_ATOM_FFI/хелло/ ***/5
42: *** W_ATOM_FFI/хелло 世界/ ***/8
42: *** W_ATOM_FFI/網目錦へび [àmímé níshíkíhéꜜbì]/ ***/26
42:
42: ERROR: /builds/brebs/aports/testing/swi-prolog/src/swipl-8.5.11/packages/cpp/test_ffi.pl:107:
42: test wchar_1: wrong "all" answer:
42: ERROR: Expected: ["//0","/ /1","/abC/3","/Hello World!/12","/хелло/5","/хелло 世界/8","/網目錦へび [àmímé níshíkíhéꜜbì]/26"]
42: ERROR: Found: ["//0","//1","//3","//12","//5","//8","//26"]
42: . done
42: % 1 test failed
The test passed for all the other architectures.
That's an interesting result. It confirms my theory that the problem isn't in library(pcre) but in something with string formatting.
I'll put together a PR to isolate more possibilities and then hand this bug to @JanWielemaker (probably within a few days).
Indeed an unexpected result. It seems the problem is with with_output_to(string(S), Goal)
rather than pcre. I've browsed a little through the code, but didn't find anything suspicious.
From an older problem I've still got a fully working QEMU s390x emulator with an older version of Alpine on my machine. I'll try to reproduce the issue there.
Git it. The patch is in SWI-Prolog/swipl-devel@a9a5c7456989614c89c8223a58dc54732a9cdf2c, fixing Sfprintf() handling for %Ws
on big-endian machines (also wrong on little endian, but most strings worked fine).
Also added pcre to the system and ran its tests. They now nicely pass. Closing.
I didn't think about big-endian being the issue. But, of course. And I'm impressed with how well SWI-Prolog deals with big- and little-endian machines. (I have a horror story about one attempt to move between little- and big-endian machines)
Anyway, I added a few more Unicode tests, and they don't work as expected for a few values. They fail on both my ChromeBook (Debiann) and Ubuntu machines; I think that the ChromeBook has a newer version of Unicode.
@jan - please take a look at the new tests in https://github.com/kamahen/packages-cpp/commit/dae35c90ea7b57ec5705b566d10a53dcefde550c
(Yes, the code is a bit too much copy-pasta; I can fix that for the final version) (I haven't made a PR for this yet, because it fails; I also intend to add these tests to test_pcre.pl)
Here is my run log (on ChromeBook): typescript.txt
Seems to work correct. I get
ERROR: /home/janw/src/swipl-devel/packages/cpp/test_ffi.pl:126:
test wchar_2b: wrong "all" answer:
ERROR: Expected: [[9968,9969,9970,9982,9983],[129792],[6320,6321,6399],[12016,12017,12018,12019]]
ERROR: Found: [[47,9968,9969,9970,9982,9983,47,53],[47,129792,47,49],[47,6320,6321,6399,47,51],[47,12016,12017,12018,12019,47,52]]
This is to be expected as w_atomffi/2 writes /%Ws/%zd
, so we expect 47 (/) before and after and a digit. This is exactly what we see.
And I'm impressed with how well SWI-Prolog deals with big- and little-endian machines
Thanks. Well, it doesn't play much of a role in most of the code. Some of the string handling and particular QLF files. It was solved long ago when we happily used SPARC and Intel side by side. As little of the relevant code changes it doesn't break that quickly. This was one an exception. This bug is there for quite a while, but Sfprintf() isn't used much and surely not with Unicode (%Ws
was a late addition anyway). And, there seem to be very few big endian machines left :cry: Only one of the current Alpine targets apparently. That is a pity because it made an issue surface that also exists on little endian machines except that it work "most of the time" there.
w_atomffi/2 writes
/%Ws/%zd
I-am-an-idiot and can't even remember the code I wrote.
What's strange is that (as you can see in the typescript.txt that I attached), the wchar_2 test previously returned one result of "//1"
instead of "/\x1FB00\/1"
(which looked exactly like the bug); but now it works correctly. No idea why I got this wrong result, but I don't see any point in looking into it.
Mysteries happen :smile:
Added a Unicode>0xffffffff test to pcre, mainly to test that PCRE2 handles such Unicode: https://github.com/SWI-Prolog/packages-pcre/commit/620002336bb1c62666da0e566b2c49151a46fd54
I think we can close this issue.
Shown at https://gitlab.alpinelinux.org/brebs/aports/-/jobs/686395#L3785
Is a regression from 8.5.9