beyondgrep / ack3

ack is a grep-like search tool optimized for source code.
https://beyondgrep.com/
Other
713 stars 66 forks source link

Paragraph mode (aka paragrep, multiline) #171

Open n1vux opened 5 years ago

n1vux commented 5 years ago

as noted in #99, ag SilverSurfer and Perl both support (respectively)--multiline or -000 paragraph mode, as of course does the eponymous one-trick-pony paragrep .

While this is most obviously useful for data and NLP use cases (officially unsupported), if coders are using the recommended vertical whitespace for paragraphing their code, -000 is directly applicable to code, and if paragraph end sequence is specifiable as an optional arg on the option, coders can specify /^[}]/ or /^\w*$/ or whatever is end of a block or sub in their corpus of code (possibly /^\t[}]/ if they like that extra tab at end of sub bodies) , which would approximate the suggested --same-subroutine feature.

(An arg specifying end-of-para pattern would also support filtering multiline structured data (like EDI-INT tagged-data multiline hierarchical record streams/files).)

(As a possible enhancement, we might later allow specifying an end-paragraph pattern for each --type in .ackrc ?)

Paragraph mode would interact with --and by providing an additional, medial mode of sameness beyond the obvious largest and smallest extents, same-line, same-file; without requiring an additional option flag, as --paragraph-mode pattern --and pattern would naturally switch same-line to same-paragraph by switching the primary object from a line-buffer to a paragraph-buffer.

Paragraph mode would require match patterns have the (?sm) or //sm flags activated so that . matches internal newlines and ^$ match next to internal newlines too, and \A\z match beginning and end of whole paragraph/chunk. (Just as PBP says to always do.) (Which means foo$.*^bar is not meaningless in --paragraph mode, same as it becomes meaningful in Perl with qr{}sm and likely -000.)

n1vux commented 5 years ago

(Slack discussion recorded in #171 and #172 is 2018-11-29 https://beyondgrep.slack.com/archives/C4J886HT2/p1543503390000600 )

elfring commented 3 years ago

:crystal_ball: I am curious how such an issue will evolve further.

n1vux commented 3 years ago

This would require significant internal changes to Ack's underpinnings so I'm not terribly hopeful, but at least Andy put the (feature) flag on instead of closing it as out-of-scope NLP (possibly convinced because of nice whitespace praxis wrapping code on multi lines).

(There are optimization techniques to avoid doing an if...else... in the innermost loop so that is not an argument against doing this. Adding the branch in the innermost read loop would be the naïve prototype. Which would then expose the downstream issues ... e.g. what to use for line numbers when reporting; do we count paras, or do we need to sum the NLs in each para for compatibility (with what? grep or paragrep?). Or does -000 require #311 and suppress line numbers?)

( I'm wishing i'd implemented this using the never-used, removedinput-filters feature of Ack2 where it would have been more natural.)

FWIW I commented on #333 that I'm now using swish-e as a light weight indexed search Text Retrieval to select (NLP) documents with words near each other irrespective of sentence, paragraph, newline for futher mashing with ack -x or just less; it's more text than code but its demo is on its own source code so it's usable on variables and keywords as tokens.

n1vux commented 2 years ago

Note from renewed Slack monologue.

(1) if Ack had a Paragraph (pgrep) mode, it would be useful for on-label-use Code search (not just off-label-use Text repo search, as it would naturally look at a blank-line separated stanza of the code file - blank line /^\s*$/ delimited, and could be extended to more semantically useful look at the range of lines that either contain a top level or outermost pair of bracket characters (by default (...) {...} but should be overridable for e.g. Text::Template files with [...] at top level?) and/or the longest sequence of lines that do NOT contain such.

e.g., find any sub that contains the magic word/pattern, and report sub name and line number via capture and --output . )

(2) Note I am not suggesting encoding override of brackets by filetype as DWIM. A user using this needs to know what brackets they want, and it might not be the language's obvious choice. Top level span of lines with matching (...) {...} seems adequate default when requesting --paragraph-bracket-mode doing brackets, and default --paragraph likely is pgrep compatible white-line delimited stanzas, which goes inside a sub if it's in stanzas, or includes leading comments not set apart with whitelines, whatever programmer marked with blank lines as a visual unit.

(3) Top level Bracket matching is of course fallible with quoted brackets in strings but que sera, cost effective best effort is good enough, anyone needing perfection needs a parser for their language of interest, not a general mixed-language-project search tool.

(And alas the obvious workaround of accepting only brackets with only optional whitespace on one side or the other would be mostly better but would miss open or close brackets with inline comment while (@args) # Check Arguments ... } # END Check Arguments in a script or method foo($argsHash) { # Foo Method ... } # END Foo Method in a class file, which without a full parser for $LANGUAGE is likewise fraught. )

(4) In Paragraph mode, the difference between ^$ and \A\z and mutation of . under /ms modifier becomes useful. But I'm unconvinced any of the perl 5.22 Unicode variant wordbreaks are useful for code search even with Paragraph mode (which is necessary but not sufficient for several of them). (Sentence break mode might be useful in code file comments?) ( Not finding a \w{wb} within q(don't) doesn't seem generally helpful in code search. A UTF8 source file theoretically could have Unicode whitespace instead of standard %20 \n\l\r but as long as it's recognized as \s when io is utf, i don't see value in supporting funky breaks. )

n1vux commented 1 year ago

Today in Slack, Andy daydreamed

Today’s pipe-dream of ack switches: The --inside-of-a-loop option, as in ack expensive_function --inside-of-a-loop I sometimes simulate this by doing something like ack expensive_function -B5 | ack 'for|foreach' -A5 I guess I could look into this using the --range-start and --range-end but it would still be pretty fuzzy.

to which i replied (after some meandering)

anything in the generalized paragrep category is probably good-enough for end-user-driven heuristic within/near -

it's up to the end-user to know their house-style of indenting and commenting.

Pattern may be ^(?:sub|method|function)\s+ or ^[}].*\K as appropriate or even ^# if non-indented comment denotes beginning of significant block of code.