beyondgrep / ack2

**ack 2 is no longer being maintained. ack 3 is the latest version.**
https://github.com/beyondgrep/ack3/
Other
1.48k stars 138 forks source link

Feature: searching by 'word' but aware of regexp syntax #559

Closed epa closed 7 years ago

epa commented 9 years ago

Create a file with contents

/HELLO/;
/\bGOODBYE\b/;

You will recognize that this is Perl regexp syntax. The first line matches HELLO anywhere in the string. The second matches GOODBYE but with word boundaries either side, that is, the whole word only. This is a common idiom for writing regular expressions to match a whole word.

The problem comes when using ack to search the codebase. ack -w GOODBYE will not find it. As far as ack is concerned the whole word mentioned in the above file is bGOODBYE.

Now, I am not saying that ack should solve the halting problem and check all possible cases where some programming language quotes a whole word or pastes it together in some wacky way. But this regexp syntax is common not just to Perl but to many other programming languages that support regular expressions. It would be useful to make ack at least a little bit aware of it, so it can spot whole words even when inside regular expressions. (Another case is /\AGOODBYE\z/ to match the whole string.)

My proposal, then, is to tweak ack's -w flag so that at the front of the word it expects either a word boundary or \x where x is some alphanumeric character. This would make it return strictly more matches than before. Of course, there would be some false positives, particularly for Windows paths (where C:\temp would now match the word emp) and for TeX documents (\box would match ox).

If that is too much, perhaps ack could become a little more programming-language-aware and turn on this enhanced whole-word check only when it is reading a Perl source file? Or a new -W flag would be used for the fancy whole-word checking, with -w keeping its existing semantics (which have proved troublesome enough already; see https://github.com/petdance/ack2/issues/445).

petdance commented 7 years ago

ack can't be learning about regex syntax of what it's searching. -w tweaks are coming in ack 3.

epa commented 7 years ago

Without wanting to go too crazy, feature requests like this might be handed by defining regexp transformations in .ackrc. So it might contain

custom-regexp-tweak-Foo: (?:\b|\[a-z])$RE\b

Then --Foo would be recognized as a new option wrapping the regexp ($RE) as specified.

You might even let it be conditionalized on file type -- useful for matching string literals, for example, which have varying syntax in different languages.

This would let people muck around with weird and wonderful matching rules without having to change the ack core. Just a suggestion, I don't know whether you will like it.

petdance commented 7 years ago

Thanks. Don't like it. There's just too much customization going on here.