Weird behavior for string operations with escaped characters

nusakan commented 2 weeks ago

EYE v10.19.7 (2024-08-24)
SWI-Prolog version 9.0.4

I may be wrong but some eye reasoner behaviors may be incorrect.

Escaped characters issue

Some builtin string: <http://www.w3.org/2000/10/swap/string#> operations have unexpected behaviors when used with strings containing backslashes ("\\", "\t" ...) characters, especially when regular expressions are involved.

All provided examples are using the Notation3 syntax. (for which the single backslash "\" is not a valid syntax).

For example:

Expression	Expected	Obtained
`"\n" string:length ?v`	`?v` is `1`	`?v` is `2`
`("a\\b" "(\\\\)" "-") string:replace ?v`	`?v` is `"a-b"`	`?v` is `"a--b"`
`("a\\b" "(.*)") string:scrape "a\\b"`	Match	No match
Other not tested string manipulation operations	whatever	Not confident...

A quick fix may be implemented for specific builtin core string functions.

However, it seems that the internal representation of literals relies on the N3 strings serialization syntax:\ "\t" is represented as \ and t which produces these weird behaviors (even "segfault"s in some more complex cases).

So the quick fix may not be way to go: it introduces supplementary computational efforts, and only addresses partially the issue.

Thanks for your attention (and all the work you have already done).

Quick & dirty fix (`v10.19.7` - for illustration purpose only).

Not for production use - it may introduce critical issues, and does not solve all the problems
Only for string builtin functions described here: https://w3c-cg.github.io/n3Builtins/ (string:scrape, string:replace, string:length)
Some statements wrongly match (see Sample test cases below )
the escape_atoms rule ensure the conversion

%
% New rule
%
escape_atoms(X, Y) :-
    when(
        (   ground(X)
        ),
        (   atom_codes(X,W),
            escape_codes(V,W),
            atom_codes(Y,V)
        )
    ),
    when(
        (   ground(Y)
        ),
        (   atom_codes(Y,V),
            escape_codes(V,W),
            atom_codes(X,W)
        )
    ).

%
% Modified rules
%
'<http://www.w3.org/2000/10/swap/string#length>'(literal(A, _), B) :-
    when(
        (   ground(A)
        ),
        (   escape_atoms(A,V),
            sub_atom(V, 0, B, 0, _)
        )
    ).

'<http://www.w3.org/2000/10/swap/string#replace>'([literal(X, _), literal(Search, _), literal(Replace, _)], literal(Y, type('<http://www.w3.org/2001/XMLSchema#string>'))) :-
    when(
        (   ground([X, Search, Replace])
        ),
        (   (   atomic_list_concat(['(', Search, ')'], Se),
                regex(Se, X, [St|_]),    % Convert back to unescaped 
                escape_atoms(S,St)   % strings before processing
            ->  (   sub_atom(Search, 0, 1, _, '^')
                ->  atom_concat(S, T, X),
                    atom_concat(Replace, T, Y)
                ;   (   sub_atom(Search, _, 1, 0, '$')
                    ->  atom_concat(T, S, X),
                        atom_concat(T, Replace, Y)
                    ;   atom_codes(X, XC),
                        string_codes(S, SC),
                        atom_codes(Replace, RC),
                        subst([[[0'$, 0'1], SC]], RC, TC),
                        subst([[SC, TC]], XC, YC),
                        atom_codes(Y, YC)
                    )
                )
            ;   Y = X
            )
        )
    ).

'<http://www.w3.org/2000/10/swap/string#scrape>'([literal(X, _), literal(Y, _)], literal(Z, type('<http://www.w3.org/2001/XMLSchema#string>'))) :-
    when(
        (   ground([X, Y])
        ),
        (   regex(Y, X, [W|_]),
            atom_string(V,W),
            escape_atoms(Z,V)
        )
    ).

Sample test cases

@prefix string: <http://www.w3.org/2000/10/swap/string#> .
@prefix : <http://example.org/test#>.

{
  "\t" string:length 1 .
  ("\\" "\t") string:concatenation "\\\t" .
  "\\\t" string:contains "\\\t" .
  "x\\A\tx" string:containsIgnoringCase "\\a\t" .
  "end\\" string:endsWith "\\" .
  "\\A\t" string:equalIgnoringCase "\\a\t" .
  ("[%s]" "\\\t\n") string:format "[\\\t\n]" .
  "\\\t1" string:greaterThan "\\\t0" .
  "\\\t0" string:lessThan "\\\t1" .
  "\\" string:matches "\\\\" .
  "\t" string:matches "\\t" .
  # "\t" string:matches "\t" .      # WARNING: wrongly resolved with no error
  "A\tB" string:notEqualIgnoringCase "a\\tB" .
  "\\\t" string:notGreaterThan "\\\t" .
  "\\\t" string:notLessThan "\\\t" .
  "\\\t" string:notMatches "\\n" .
  # "\\\t" string:notMatches "\t" . # WARNING: wrongly resolved with no error
  ("a\\b" "(\\\\)" "\t[$1]\t") string:replace "a\t[\\]\tb" .
  ("\\" "(.*)") string:scrape "\\" .
  ("\\" "(\\\\)") string:scrape "\\" .
  ("\t" "(.*)") string:scrape "\t" .
  ("\t" "(.*)")!string:scrape string:length 1 .
  "\\start" string:startsWith "\\" .
} => {
  :test :passed true .
} .

josd commented 2 weeks ago

Thank you very much for your fix which is more or less done in the latest eye and lingua.

nusakan commented 2 weeks ago

Thank you for your fast commit.

It seems however that there are still few failures. I've written a small test suite in Notation3 (extended to other string: rules), covering a larger scope (like false positives/negatives or errors).

I've uploaded here: string_test.n3.txt (renamed to .txt as github does not recognize .n3 as text files)

You may run it with:

eye --quiet --nope string_test.n3

This test suite may not be quiet exact (I've not found an explicit documentation for the all the string: rules), you may review it before considering the test should pass or fail.

Here are the failing properties as exposed by the test suite alongside the involved tests (see the test suite for reference) (some core builtin rules are still there) with EYE v10.19.8 (2024-08-30) (master branch):

@prefix string: <http://www.w3.org/2000/10/swap/string#>.
@prefix : <http://example.com/string_test.n3#>.

string:greaterThan :fail (:t090 :t091 :t092 :t093 :t094 :t095).
string:lessThan :fail (:t110 :t111 :t112 :t113 :t114 :t115).
string:matches :fail (:t140).
string:notContainsRoughly :fail (:t150).
string:notGreaterThan :fail (:t173 :t174 :t175 :t170 :t171 :t172).
string:notLessThan :fail (:t183 :t184 :t185 :t180 :t181 :t182).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t211 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).
string:substring :fail (:t261 :t262).
string:upperCase :fail (:t270).
string:contains :fail (:t020).
string:containsIgnoringCase :fail (:t030).
string:containsRoughly :fail (:t040).
string:endsWith :fail (:t050).
string:format :error (:t081).

Best regards

josd commented 2 weeks ago

Thank you for your fast test development👍 Your cases should also perfectly fit in https://github.com/eyereasoner/eye/blob/master/reasoning/bi/biP.n3#L111 Anyway, I first try to fix the issues and some 30 of them are now fixed in https://github.com/eyereasoner/eye/releases/tag/v10.19.9 and https://github.com/eyereasoner/lingua/releases/tag/v1.4.1

The remaining ones are now

string:matches :fail (:t140).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).
string:substring :fail (:t262).
string:format :error (:t081).
string:replaceAll :error (:t211 :t212).

josd commented 1 week ago

We are now at

string:matches :fail (:t140).
string:notMatches :fail (:t190).
string:replaceAll :fail (:t210 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).

nusakan commented 1 week ago

It appears that raised errors are independent of tests success or failure.

Errors such as:

** ERROR ** eam ** error(syntax_error(missing closing parenthesis),_624)
[...]

.. are raised with the following passing tests:

:t203 :assertNone { ("a\tb" "\\" "-") string:replace "a-tb" } .
:t213 :assertNone { ("a\tb" ("\\") ("-")) string:replaceAll "a-tb" } .
:t223 :assertNone { ("a\tb" "(\\)" ) string:scrape ?t223v } .
:t234 :assertNone { ("a\tb\nc" "(\\)") string:scrapeAll ("t") } .

While the following failing tests do not raise any error:

:t215 :assertNone { ("\tb" ("\t" "b") ("x" "y")) string:replaceAll "xy" } .
:t232 :assertNone { ("a\tb\nc" "(\t|\n)") string:scrapeAll ("\t" "\n") } .
:t244 :assertNone { ("a\\\t" "(\\+)") string:search ?t244v } .

I see no problem for regular expressions expressed with wrong syntax to have errors turned into warning messages, provided it follows the SNAF rule, as it provides substantial information (even if hard to debug). However for consistency reasons, the same behavior should be adopted in any case.

josd commented 1 week ago

I guess those exceptions might have to do with https://github.com/SWI-Prolog/packages-pcre/issues/24 so will stay tuned. After a few tweaks we are now at

string:replaceAll :fail (:t211 :t215).
string:scrapeAll :fail (:t231 :t232).
string:search :fail (:t244).

eyereasoner / eye