Word boundaries with non-ASCII character

teoric commented 9 years ago

I am trying to look for the presence of a word containing non-ASCII characters, and this is not possible:

ack '\büber'
ack -w über

The first should find me exactly the lines containing a word starting with über, the second should find exactly the lines where über is a single word, shouldn't it? Texts are UTF-8, and dropping the boundaries gives thousands of results, as does searching with pcregrep.

The first line also returns lines containing words that contain über (such as darüberhinaus) and the second one also those containing words ending in über (such as darüber), which seems to suggest that the boundary matches before ü, i.e. ü is not counted as a word character (but should be).

Locale is set to "de_DE.UTF-8", but unsetting it does not change anything.

(ack 2.12 / perl 5.18.2 with Ubuntu 14.04, and ack 2.14 / perl 5.22 on Mac OS X 10.11)

Matches can be made correct, as far as I can see, adding

use feature unicode_strings; # optional? use re "/u";

Switching to Unicode processing would probably also help to attack beyondgrep/ack3#262 .

to the beginning of the ack script (probably also if adding to the library). Maybe this can be made an option for non-ASCII-ists?

n1vux commented 9 years ago

I confirm behavio[u]?r, in default US locale. Sad that DE locale doesn't 'just work'. :-(

User Workaround appears to be to use a (seemingly, should be) redundant \b with -w and (?u) (the local equivalent of /u): ack -w '(?u)\büber' . (Might need a trailing \b in addition/instead if ending accented.) [See below].

While naturally supporting natural language is a "nice to have", 'ack' is defined as a source-code search tool, not a natural-language corpus lexical search tool. (If it were for natural language, we'd have long since added the much-requested paragraph-mode. Unix moto: do one thing well.) When/where 'perl' is built with Unicode as default and works right with /\büber\b/ , 'ack' will work so too. Until then, hacking in a local fix for perlre having broken unicode+locale \b handling is really a Perl5 bug.

Possible solutions

detect extended characters, and apply $pat="(?u)\b$pat\b"; if -w with unicode.
new --unicode flag to apply (?u) always, and \b...\b if -w . TBD - cross reference with linked ticket.

$ ack '(?u)über' .perlbrew/libs/perl-5.16.2@full/lib/perl5/POD2/DE/local/lib.pod
du es über diesen Weg zu deinem Shell Startup Skript hinzufügen:
Die "~" wird übersetzt zu dem Benutzer Profil Verzeichnis (das Verzeichnis was
Versucht den angegebenen Pfad anzulegen, mit allen nötigen drüberliegenden
Der daraus resultierende Pfad wird zu L</resolve_empty_path> übergeben, dessen
an L</resolve_path> übergeben, welches dann den Rückgabewert stellt.
man vorsichtig sein über die Tatsache das der Prozess über die Neuinstallation
benutzt wenn man sehr sicher darüber ist welche Konsequenzen einem

$ ack -w '(?u)über' .perlbrew/libs/perl-5.16.2@full/lib/perl5/POD2/DE/local/lib.pod
du es über diesen Weg zu deinem Shell Startup Skript hinzufügen:
man vorsichtig sein über die Tatsache das der Prozess über die Neuinstallation
benutzt wenn man sehr sicher darüber ist welche Konsequenzen einem

$ ack -w '(?u)\büber' .perlbrew/libs/perl-5.16.2@full/lib/perl5/POD2/DE/local/lib.pod
du es über diesen Weg zu deinem Shell Startup Skript hinzufügen:
man vorsichtig sein über die Tatsache das der Prozess über die Neuinstallation

teoric commented 9 years ago

Thank you very much for this solution! Maybe just mentioning (?u) in the documentation would be enough, even though a --unicode / -u flag would be nice.

Regarding the statement on "natural language", I disagree. Note that most source code nowadays is not confined to ASCII any more (even Perl: use utf8 etc., though style guides, and precautions against hitting the walls of old-fashioned anglocentric programs are a different thing!), so that even a code searching tool should not confine it self to ASCII text.

n1vux commented 9 years ago

I don't disagree about supporting searching beyond-ascii code being useful. (?i) and \bshould work; 'ü' should be \w (and lowercase) in your locale at least, probably in mine too. It's not punctuation or spacing! I was just pointing out that Andy has continually refuted the heretical notion that 'ack' could be excellent at both code-search and natural-language search (although I am not entirely in agreement with that), and as such is the project guidance, being no-more-broken-than-Perl with accented characters until Perl gets better with them may well be tolerable.
We should verify that the ?i and -w \b bugs are logged against Perl5 upstream and see if / when they fix them. If fixing both this and beyondgrep/ack3#262 are simple, in ways that will not break the code for others, nor break when/where Perl unicode+locale gets fixed so that REs just work, we should do so. I gave two brainstorms for this symptom, and may eventually look at beyondgrep/ack3#262 and the code to see if the suggested synergy is there. Since Andy doesn't Unicode or Locale much, it will probably fall to one of us who has some accented text files handy and an second locale enabled ... when i get some of those mythical "Round TUITs". I would like to count how many Mercier vs Merçier i have! (I fear I've been wrong for a while.) Alas most of the French text on my drive is in those Natural Language files Andy says Ack isn't intended for. :-(

hoelzro commented 9 years ago

It's not just the /u/(?u) flag that's needed; the text being searched also needs to be decoded into a Unicode string (rather than an octet sequence) for Unicode regular expressions to work. For example:

$ perl -E 'say "matches" if "über" =~ /^\w+$/u'
<nothing>
$ perl -E 'use utf8; say "matches" if "über" =~ /^\w+$/u'
matches

Keep in mind as well that "über" can have different representations even when encoded as UTF-8; you can have "\N{LATIN SMALL LETTER U WITH DIAERESIS}ber" or "u\N{COMBINING DIAERESIS}ber":

$ perl -CO -E 'use utf8; say "\N{LATIN SMALL LETTER U WITH DIAERESIS}ber"'
über
$ perl -CO -E 'use utf8; say "u\N{COMBINING DIAERESIS}ber"'
über

A lot of the hurdles that full Unicode support would require are mentioned on https://github.com/petdance/ack2/wiki/Plans-for-2.1:-Unicode-Support.

The problem with Unicode support is that older perls don't have the support (ack is designed to run on 5.8 or better), it incurs a performance hit for a case that (seemingly) seldom occurs, and there are some open questions on how to do it (for example, what encoding are search targets in? Granted, we treat them as ASCII right now, so UTF-8 is probably a good bet). We could really use an encoding wizard like @patch to steer us in the right direction! But, as @n1vux said, Andy's been reluctant to include Unicode support for the reason of keeping ack focused as a source-code searching tool.

hoelzro commented 9 years ago

I'm also wondering if mucking around with PERL_UNICODE (from perlrun) would address the issue.

n1vux commented 9 years ago

The problem with Unicode support is that older perls don't have the support (ack is designed to run on 5.8 or better)

I think we can accept that users needing Unicode will have, or be willing to get, a Unicode-capable perl.

Having any such enabled by --unicode in .ackrc or commandline and/or PERL_UNICODE in .bashrc etc would isolate non-users from penalty.

(Whether we can/want to include error saying "we'd do that if you upgrade your Perl" or have features that are only enabled when Perl is new enough is another topic.)

hoelzro commented 9 years ago

I think we can accept that users needing Unicode will have, or be willing to get, a Unicode-capable perl.

@n1vux Ok, true. =) If we ended up doing this, I also favor --unicode for the same reasons.

n1vux commented 9 years ago

Bingo, @hoelzro 's PERL_UNICODE comment on beyondgrep/ack3#262 is a better user workaround here too.

$ PERL_UNICODE=SAD ack -w 'über' .perlbrew/libs/perl-5.16.2@full/lib/perl5/POD2/DE/local/lib.pod
du es über diesen Weg zu deinem Shell Startup Skript hinzufügen:
man vorsichtig sein über die Tatsache das der Prozess über die Neuinstallation

pdl commented 9 years ago

even a code searching tool should not confine it self to ASCII text

I definitely agree. And - because I know Andy likes to have concrete examples - I want to confirm from experience this is not just a theoretical question: I have on more than a few occasions written code in Perl and XSLT to handle 'special cases' - e.g. Turkish dotless i (which has different casing rules) or Polish L (unicode decomposition doesn't separate stroke) or curly quotes (upgrading to/downgrading from) - and searching in my code for those letters to find previous work with them would have been a boon at the time.

That said, it's a big chunk of work to get right and it does need to be done right for it to be worth it.

patch commented 9 years ago

Although I’m a big fan of ack, I use it less and less these days because of its lack of Unicode support. Source code can and does contain non-ASCII characters (utf8 pragma, anyone?).

Here are some real-world examples:

Shutterstock’s open source CPAN module Lingua::Stem::UniNE (see https://metacpan.org/source/PATCH/Lingua-Stem-UniNE-0.08/lib/Lingua/Stem/UniNE/BG.pm) as well as other open source and internal projects. Non-ASCII characters can be found in strings, regexes, comments, and pod, as well as identifiers although I don't personally do that. I mention this because it’s not just an edge-case or curiosity, but used by large international companies.
The Unicode CLDR data files in XML and JSON formats. For example this JSON file providing English locale data for currencies: https://github.com/unicode-cldr/cldr-numbers-modern/blob/master/main/en/currencies.json

Note that although much Unicode regex functionality has been added to major Perl releases over the years, and both Perl 5.12 and 5.14 added improvements to prevent common bugs, I consider any version starting at 5.8.1 to have adequate Unicode support for basic usage. Most of the new functionality only relates to ack in that you would be allowed to perform searches using new regex features if you have them in your version of Perl.

n1vux commented 9 years ago

Hi Nick, good to hear from you.

I am philosophically in full agreement with you but pragmatically lean towards evolving with Perl, not trying to solve Perl's problems from within an app.

Did you note the comment on your linked article?

"Perl is a nightmare to use with unicode. Precisely because it attempts to do what you advocate."

You have a workaround to make your Ack work Unicode full time, even before we can provide --unicode for .ackrc :

alias ack='/usr/bin/env PERL_UNICODE=SAD ack'

although i would like to run our full 'prove' testcase suite with it before recommending that widely ! ( Edited to add: test results attached to beyondgrep/ack3#258 not terrible but not clean )

n1vux commented 9 years ago

Further notes re PERL_UNICODE=SAD, per perlrun which see for more detail,

PERL_UNICODE=SADL would be dependent upon Locale, while without the L is unconditional.
SAD applies to all i/o and args, but fine grain enablement is possible.

patch commented 9 years ago

I am philosophically in full agreement with you but pragmatically lean towards evolving with Perl, not trying to solve Perl’s problems from within an app.

What I’m saying is that Perl has already solved these problems, although not everything is enabled by default.

That is, you still need something along these lines for Unicode and UTF-8 to work seamlessly:

use v5.14;
use utf8;
use open qw( :utf8 :std );

“Perl is a nightmare to use with unicode. Precisely because it attempts to do what you advocate.”

Perl has the best core Unicode support of any existing programming language. There are two traditional problems though:

Boilerplate is required to maintain backward compatibility; see the three lines above for scripts/apps or just the first two for modules/classes.
The dreaded Unicode Bug which was a problem solved largely by v5.12 and completely by v5.14, but does require explicit use of the version (first line above).

Perl/Unicode problems have been solved for years. The problem now is educating everyone how to fix their code. With the above three pragmas in your toolbox, you don’t need to worry about Unicode, can work exclusively with logical characters, and have the advantage of an extremely powerful regex engine with the best Unicode support.

Thank you for pointing out PERL_UNICODE, which I didn’t realize would fix ack. I personally think the most beneficial short-term change (with the least amount of work!) would be to clearly document this in the ack docs. I’d be happy to contribute this after I return from vacation.

n1vux commented 9 years ago

Thank you for pointing out PERL_UNICODE, which I didn’t realize would fix ack. I personally think the most beneficial short-term change (with the least amount of work!) would be to clearly document this in the ack docs. I’d be happy to contribute this after I return from vacation.

That would re much appreciated!

As long as Perl 5.8 is the only /bin/perl on legacy IT systems, standalone ack needs to work there too. To make your suggested 5.14 solution work when available yet still work with 5.8 (without unicode), we'd need to test if $] lt '5.014'; in a BEGIN block ? This applies to beyondgrep/ack3#258 ...

How does use utf8; use open qw( :utf8 :std ); deal with non-UNICODE fugly codepoints in legacy input files?

patch commented 9 years ago

To make your suggested 5.14 solution work when available yet still work with 5.8 […]

ack doesn’t actually need use v5.14 or any other version greater than v5.8. The specific feature imported from that statement (also available via use feature 'unicode_strings' in v5.12+) isn’t required for Unicode support. It allows you to implicitly upgrade UTF-8 encoded byte strings to logical character strings seamlessly when performing actions like concatenating a byte string to a character string, instead of treating the byte string as if it were Latin-1. Without the unicode_strings feature you just have to be more explicit about encoding and decoding. Using the open pragma as described will handle the I/O for you. Some caution has to be taken with CPAN modules because you never know which ones return byte strings and which ones return character strings, unless that’s documented and it frequently isn’t. That is Perl’s big “Unicode nightmare”—although it has provided great Unicode support for over a decade, people use it inconsistently and/or incorrectly, including many CPAN authors.

How does use utf8; use open qw( :utf8 :std ); deal with non-UNICODE fugly codepoints in legacy input files?

The utf8 pragma just declares that your source code is in UTF-8, expands the characters that you can use for identifiers, and implicitly decodes the string literals to character strings. It has no effect on I/O and interaction with other modules.

The open pragma declares the character encoding for I/O and implicitly decodes on input and encodes on output. Specifying :utf8 declares the expected encoding for filehandles and :std expands that to STDIN, STDOUT, and STDERR. Although I normally use :encoding(UTF-8) instead of :utf8 because the former provides a proper UTF-8 implementation, I recommend :utf8 for ack because it provides Perl’s loose interpretation of UTF-8, as used internally to represent character strings, and I wouldn’t want ack to croak if it encounters invalid UTF-8 sequences.

epa commented 9 years ago

Guys, I suggest that the Unicode discussion here is a red herring. This bug is another instance of https://github.com/petdance/ack2/issues/445. The patch for that issue also fixes this one.

n1vux commented 9 years ago

Re beyondgrep/ack2#445 , The /usr/bin/env PERL_UNICODE=SAD ack workaround discussed above is consistent with the above observation: it'd be a bug for Unicode or Accented character to be treated as (or same as) a Metacharacter, so even if assuming that the beyondgrep/ack2#445 patch masks this symptom by making -w smarter, Unicode remains the elephant in the room. ü is alphabetic and should be treated as such as \w, not as /^\w\W$/. If files are read in such a way that ü is split into an shifty byte that looks alphabetic (the ubiquitous Ã) and data byte, it will not be recognized for what it is, a letter.

epa commented 9 years ago

n1vux, thanks for your observations. I was mistaken that beyondgrep/ack2#445 fixes this issue entirely. Although it means that ack -w über no longer matches darüberinhaus, it will still wrongly match darüber. So turning on the correct Unicode support is still needed.

n1vux commented 2 years ago

Related, saving thoughts from Slack discussion prompted by discussion of new features.

In Perl 5.22+, \w{wb} is a Unicode Word break tuned for natural language. (i gather from an example that this may keep don't together if splitting on \w{wb} instead of on \w.) This doesn't seem terribly useful for searching code. (And if i want this behavior when off-label searching a text corpus, i can inject it into my RE myself.)

The Unicode LineBreak variant of \w likewise added in 5.22 isn't likely to be useful for code either, even if paragraph mode were added to search multiline code stanzas (one might hope ^$ would do that when use utf8 but IDK) or top level matched brackets (plus rest of line context).

ambs commented 1 year ago

I just added use locale; in the ack script to have proper boundaries. Isn't that possible to be there by default?

n1vux commented 1 year ago

@ambs , are Unicode word-breaks useful for parsing some programming languages?

(Andy @petdance is adamant this is a programmer's search tool not a natural language search tool. Despite the rather obvious off-label usage by those of us who search code, structured data, and text, and don't want 3 tools.)

petdance commented 1 year ago

Is adding use locale; all that's necessary to have the proper boundaries? What other effects would it have?

n1vux commented 1 year ago

What other effects would it have?

Well, it would requiring adjusting the running of many of the t/ tests to avoid failure when run outside the EN.US & EN.UK world, as it would make definition of \w \W \d \D \b and maybe even \s \S into $LC_CTYPE $LC_NUMERIC dependent definitions, instead of flatascii always the same everywhere. It would also change the collating sequence in character lt etc comparisons.

The damage limitation might be as simple as adding export LC_ALL=C to the makefile for target test?

This is tantalizingly simple but quite reasonably, yet quite dramatically, enabling locales is very invasive. And to do it correctly would require multiplying the number of tests, there should be equivalent tests for many tests run explicitly in US.UTF-8 and other non-English non-ASCII locales as well as LC_ALL=C.

(The simplest solution is to have the line there by default but commented out, so changing a # to a space is all that is needed, maybe even with an explanatory comment "for those needing local non-Ascii characters in \w and thus in \b".)

ref perldoc perllocale

n1vux commented 1 year ago

since people can use local accented characters in their Perl variable names, this is not strictly an NLP usecase, but applies to sourcecode as well. (OTOH they can also use non-Locale accented characters too.)

perl -Mutf8 -E ' my $élan=1; say $élan;'
1

ambs commented 1 year ago

Is adding use locale; all that's necessary to have the proper boundaries? What other effects would it have?

I still had some word boundaries issues, but far less than in the beginning. To be prudent, can that be activated given a --locale flag?

petdance commented 1 year ago

I'm open to any solution if we know what the ramifications are of it.

So for someone who is not at all familiar with character encoding, etc, what do we describe the --locale option (and corresponding --nolocale that we would need) as doing? I'm assuming that in the code, it would just be roughly

if ( $opt{locale} ) {
    eval "use locale; 1;"
}

What does that do for the user, in English? What does that enable for the user? Are there are any tradeoffs? Does it slow things down? Can it break otherwise-working regexes? etc etc etc

ambs commented 1 year ago

Regarding behavior change, I would say something like:

enables regular expressions classes, like \w, to be dependent of your current operating system locale options. This might, for instance, allow that \w match any character that is considered part of a word in your selected language.

But given I am no expert, I am happy if someone fixes my suggestion :-)

n1vux commented 1 year ago

if ( $opt{locale} ) {
    eval "use locale; 1;"
}

I'm more confident of this working in Ack than in most programs, since our key REs are dynamically compiled from the commandline, so yes, for Ack this is a plausible solution?

beyondgrep / ack3

Word boundaries with non-ASCII character #267