Open n1vux opened 5 years ago
(Slack discussion recorded in #171 and #172 is 2018-11-29 https://beyondgrep.slack.com/archives/C4J886HT2/p1543503390000600 )
:crystal_ball: I am curious how such an issue will evolve further.
This would require significant internal changes to Ack's underpinnings so I'm not terribly hopeful, but at least Andy put the (feature) flag on instead of closing it as out-of-scope NLP (possibly convinced because of nice whitespace praxis wrapping code on multi lines).
(There are optimization techniques to avoid doing an if...else...
in the innermost loop so that is not an argument against doing this. Adding the branch in the innermost read loop would be the naïve prototype. Which would then expose the downstream issues ... e.g. what to use for line numbers when reporting; do we count paras, or do we need to sum the NLs in each para for compatibility (with what? grep or paragrep?). Or does -000
require #311 and suppress line numbers?)
( I'm wishing i'd implemented this using the never-used, removedinput-filters feature of Ack2 where it would have been more natural.)
FWIW I commented on #333 that I'm now using swish-e
as a light weight indexed search Text Retrieval to select (NLP) documents with words near each other irrespective of sentence, paragraph, newline for futher mashing with ack -x
or just less
; it's more text than code but its demo is on its own source code so it's usable on variables and keywords as tokens.
Note from renewed Slack monologue.
(1) if Ack had a Paragraph (pgrep) mode, it would be useful for on-label-use Code search (not just off-label-use Text repo search, as it would naturally look at a blank-line separated stanza of the code file - blank line /^\s*$/
delimited, and could be extended to more semantically useful look at the range of lines that either contain a top level or outermost pair of bracket characters (by default (...)
{...}
but should be overridable for e.g. Text::Template
files with [...]
at top level?) and/or the longest sequence of lines that do NOT contain such.
e.g., find any sub
that contains the magic word/pattern, and report sub name and line number via capture and --output
. )
(2) Note I am not suggesting encoding override of brackets by filetype as DWIM.
A user using this needs to know what brackets they want, and it might not be the language's obvious choice.
Top level span of lines with matching (...)
{...}
seems adequate default when requesting --paragraph-bracket-mode
doing brackets, and default --paragraph
likely is pgrep compatible white-line delimited stanzas, which goes inside a sub if it's in stanzas, or includes leading comments not set apart with whitelines, whatever programmer marked with blank lines as a visual unit.
(3) Top level Bracket matching is of course fallible with quoted brackets in strings but que sera,
cost effective best effort is good enough, anyone needing perfection needs a parser for their language of interest, not a general mixed-language-project search tool.
(And alas the obvious workaround of accepting only brackets with only optional whitespace on one side or the other would be mostly better but would miss open or close brackets with inline comment while (@args) # Check Arguments
... } # END Check Arguments
in a script or method foo($argsHash) { # Foo Method
... } # END Foo Method
in a class file, which without a full parser for $LANGUAGE is likewise fraught. )
(4) In Paragraph mode, the difference between ^$ and \A\z and mutation of . under /ms
modifier becomes useful. But I'm unconvinced any of the perl 5.22 Unicode variant wordbreaks are useful for code search even with Paragraph mode (which is necessary but not sufficient for several of them).
(Sentence break mode might be useful in code file comments?)
( Not finding a \w{wb}
within q(don't)
doesn't seem generally helpful in code search. A UTF8 source file theoretically could have Unicode whitespace instead of standard %20 \n\l\r but as long as it's recognized as \s when io is utf, i don't see value in supporting funky breaks. )
Today in Slack, Andy daydreamed
Today’s pipe-dream of ack switches: The
--inside-of-a-loop
option, as inack expensive_function --inside-of-a-loop
I sometimes simulate this by doing something likeack expensive_function -B5 | ack 'for|foreach' -A5
I guess I could look into this using the--range-start and --range-end
but it would still be pretty fuzzy.
to which i replied (after some meandering)
anything in the generalized paragrep category is probably good-enough for end-user-driven heuristic within/near -
pattern1 --within=n pattern2
existing --range-start PATTERN --range-end PATTERN
-000
equiv (is that --range-start '^$' --range-end '^$'
? If so add that to Cookbook ! or will that skip every other para?)--paragraph pattern
(is this just syntactic sugar to --range-start/end
set same? ditto.)it's up to the end-user to know their house-style of indenting and commenting.
Pattern may be ^(?:sub|method|function)\s+
or ^[}].*\K
as appropriate or even ^#
if non-indented comment denotes beginning of significant block of code.
as noted in #99,
ag
SilverSurfer and Perl both support (respectively)--multiline
or-000
paragraph mode, as of course does the eponymous one-trick-ponyparagrep
.While this is most obviously useful for data and NLP use cases (officially unsupported), if coders are using the recommended vertical whitespace for paragraphing their code, -000 is directly applicable to code, and if paragraph end sequence is specifiable as an optional arg on the option, coders can specify
/^[}]/
or/^\w*$/
or whatever is end of a block or sub in their corpus of code (possibly/^\t[}]/
if they like that extra tab at end of sub bodies) , which would approximate the suggested--same-subroutine
feature.(An arg specifying end-of-para pattern would also support filtering multiline structured data (like EDI-INT tagged-data multiline hierarchical record streams/files).)
(As a possible enhancement, we might later allow specifying an end-paragraph pattern for each
--type
in.ackrc
?)Paragraph mode would interact with
--and
by providing an additional, medial mode of sameness beyond the obvious largest and smallest extents, same-line, same-file; without requiring an additional option flag, as--paragraph-mode pattern --and pattern
would naturally switch same-line to same-paragraph by switching the primary object from a line-buffer to a paragraph-buffer.Paragraph mode would require match patterns have the
(?sm)
or//sm
flags activated so that . matches internal newlines and^$
match next to internal newlines too, and\A\z
match beginning and end of whole paragraph/chunk. (Just as PBP says to always do.) (Which meansfoo$.*^bar
is not meaningless in--paragraph
mode, same as it becomes meaningful in Perl withqr{}sm
and likely-000
.)