beyondgrep / ack2

**ack 2 is no longer being maintained. ack 3 is the latest version.**
https://github.com/beyondgrep/ack3/
Other
1.48k stars 138 forks source link

Trim or ignore long lines #596

Closed wheany closed 5 years ago

wheany commented 8 years ago

If ack finds a match in e.g. minified js file, or other files with just one or few very long lines, it will flood the output with the contents of the file and likely push all useful matches off-screen.

I would like an option to either ignore such matches altogether (maybe list the file name), or possibly to only show some number of characters worth of context around the match.

petdance commented 8 years ago

Ack should already be ignoring minified JS files. What JS files is your ack finding?

wheany commented 8 years ago

Minified js was maybe a bad example because they usually have .min.js extension, but there are minified js files that don't have the extension. I think the Vaadin/GWT UI toolkit for one.

Also some tools produce minified html and xml files, which also cause problems with ack.

petdance commented 8 years ago

What happens when you grep these files?

wheany commented 8 years ago

I don't understand the question.

If I grep (or 'ack' or 'ack-grep', doesn't matter) these files and they have a match, I get a several screenfuls of text with a matches highlighted somewhere in the mess.

petdance commented 8 years ago

I don't understand the question.

Not a trick question. Just wanted to know what grep did. For the most part, I try to keep ack and grep behaving the same.

wheany commented 8 years ago

Well, in that case, both grep and ack fill the terminal with useless amounts of text. Depending on where I run the command, that can mean thousands of rows of scrollback, if I'm unlucky enough to have multiple minified files that match.

One possibility could be being able to define characters that work as line breaks depending of file format. E.g. if you find a match in a .js file with long lines, treat semicolons like linefeeds (for the purposes of -A -B and -C switches)

petdance commented 8 years ago

E.g. if you find a match in a .js file with long lines, treat semicolons like linefeeds (for the purposes of -A -B and -C switches)

That's a level of source awareness that we don't want to get into.

wheany commented 8 years ago

It wouldn't have to be language aware code, it could be an option, just like --type-set or --ignore-dir

This is probably only a problem with languages that can be minified in the first place, and those have to have some other statement separator, so it could be like --statement-separator or --record-separator or something similar.

n1vux commented 7 years ago

A co-worker just IM'd me with same complaint about *.js. Since not all minifiers follow .min.js quasi-convention, there's much noise. grep doesn't do any DWIM magic, we do, so expectations are higher of ack.

added: Another coworker suggests precontext='(?m:^|;)[^;]{0,20}'; postcontext='[^;]{0,20}(?m:;|$)'; will emulate KWIC (but for indent when less than 20) and bold='^[[7m'; unbold='^[[0m' with ack --output "\$1$bold\$2$unbold\$3" "($precontext)($pattern)($postcontext)" will even highlight the workaround. Ugly but possible!

petdance commented 7 years ago

Does grep not have an option to trim lines? I'm not seeing one.

n1vux commented 7 years ago

which grep ? IDK. If gnu grep has one it would be good to be compatible, but we can be better. (added: i don't see one https://www.gnu.org/software/grep/manual/grep.html#Output-Line-Prefix-Control nor in FreeBSD's FreeGrep)

petdance commented 7 years ago

Adding this kind of option gets things pretty ugly what with highlighting the matches etc.

It sounds to me that the solution should probably be more towards people excluding files to not search files they know they want to ignore. Truncating result lines so they don't explode your screen is just saying "Let's work hard to make things more palatable that we don't even want to see anyway."

n1vux commented 7 years ago

Truncating result lines so they don't explode your screen is just saying "Let's work hard to make things more palatable that we don't even want to see anyway."

I agree they should say --perl or --type=clojure if that's what they mean, which ignores all JS whether we can tell it's .min.js or not.

And if both the minified and full JS are in the tree, they should arrange .ackrc to ignore the minified directories. We detect .min.js if that is in use. Adding ignore .js in .ackrc in dir containing minified only may help sometimes. If we consulted .gitignore it might help DWIM otherwise. Classifying files with average line length() > 1024 as binary might help. But allowing users to say ignore lines > 1024 or 256 or whatever is good too.

If only the minified is available -- e.g. not shipped, or compiled from Clojure -- and they want to see where the JS calls the back end, maybe specifically having asked for --type=js or looking for all mentions of domain-specific word, seeing statements instead of lines would help them with minified files.

( Maybe that's setting $/ aka $INPUT_RECORD_SEPARATOR $RS ? I don't think we support that ... nor can we ? Might require a preprocess filter co-routine to expand and give statement numbers as faux line numbers? That's less ugly and modular but still invasive. )

adifinem commented 7 years ago

In some cases it's actually desirable to see results within minified js (assuming it's all that's available) to put back-end code into context of a front-end call, for instance. Having the option to truncate excessively long lines at a specified limit or otherwise provide limited contextual results would be flexible and useful in a number of common use cases, rather than just excluding the files outright.

n1vux commented 7 years ago

I was (am?) a fan of the old KWOC/KWIC formats. (I say 'was' because who really needs a lineprinter corpus index (concordance) in the 21stCentury! But context index is still plausibly useful for text searching online.) That --output lets me generate KWOC and nearly-KWIC thrills me. I don't think we need --kwic-.... options. Maybe I can write-up a KWOC/KWIC idiom or wrapper for documentation ...

petdance commented 7 years ago

What are KWOC/KWIC?

n1vux commented 7 years ago

On Tue, Mar 14, 2017 at 12:46 AM, Andy Lester notifications@github.com wrote:

What are KWOC/KWIC?

https://en.wikipedia.org/wiki/Key_Word_in_Context

-- Bill Ricker bill.n1vux@gmail.com https://www.linkedin.com/in/n1vux

petdance commented 7 years ago

Closed and moved to wiki. https://github.com/petdance/ack2/wiki/Feature-requests

kevinlawler commented 7 years ago

Could you reconsider this? It's been an issue in ack for 10+ years. I can't imagine anyone considers scrolling through pages of the following to be desired functionality, and it's a common occurrence in codebases these days.

screen shot 2017-04-24 at 3 25 41 pm

The simple way to do it is to add an .ackrc compatible option that consists of a boolean flag and/or a width max limit. You don't have to get fancy with the trimming: put the matches in the center of the buffer when the line exceeds the width. This gives context on both sides and it's OK if the default buffer width results in a few lines of visual output (instead of thousands).

petdance commented 7 years ago

It's been an issue in ack for 10+ years.

It's been an issue with grep since the beginning of time.

I don't understand what you mean by "put the matches in the center of the buffer when the line exceeds the width."

kevinlawler commented 7 years ago

line: the long matched line in a file
match: the text that is highlighted (substring of line) buffer: the truncated line storage

The issue with truncating lines is that it isn't clear how to display a partial line as opposed to a full line. By making the buffer at least as large as the match, then you can find the middle of the buffer by dividing the buffer length by two (and the middle of the match by dividing its length by two), and then you can put the middle of the match in the middle of the buffer. This gives equal context on either side of the match. This is a simple and good enough way to do it, though it is not the only way.

(When the line beginning or line end would be present in the buffer, then left or right align instead, in turn.)

petdance commented 7 years ago

Let's not get bogged down in the details of how it would be implemented internally, and keep it to the user interfece.

It sounds like you're suggesting that in the case of overrunning --maxwidth that ack print out some portion of the line that has the match on it, right? Something like this?

47: ...  stuff that is from the middle of the line **MATCHED TEXT** more but not to the end...

How do we handle multiple matches per line? What if acked on a comma and there are 1000 matches on the line in your minified javascript?

How do we handle lines that are longer than --maxwidth that show up in the context lines when using -A, -B and -C?

I have ideas for output that I don't want to put out here yet, but I don't see a way to handle the two scenarios above and still display matches.

kevinlawler commented 7 years ago

It sounds like you're suggesting that in the case of overrunning --maxwidth that ack print out some portion of the line that has the match on it, right? Something like this?

Yes

I think the way to think about an option like --maxwidth is as a UI nicety instead of as something that plays nicely with rigorous output parsing scripts. The way I (and I presume from the Google hits, a lot of other people) use ack is as a nicer grep: I want to know what files trigger, primarily, and then secondarily it's nice to see what line numbers and what context, but ultimately, if it looks good, I'm going to open that file in my editor and jump to the matched keyword.

So for

How do we handle multiple matches per line? What if acked on a comma and there are 1000 matches on the line in your minified javascript?

my answer would be: we could highlight/include only those matches that fit in the buffer starting with the first match. Now the objection to this is that you drop some valid matches, but for the stated use case, this is OK---it's only not OK if you're doing some kind of piped scripting or something.

One way to signal this to the user is to make the name obviously not script-friendly, e.g. --pretty-maxwidth or somesuch. Another way to make it play nicely with scripts is, possibly, to detect when you're outputting to a terminal and only do it then, say --terminal-output-maxwidth, and this functionality is, I think, already built-in for the coloring.

For -A, -B, -C I think the answer is to truncate the other lines as well, and that it would be fine to left-align and right-truncate, though I'm not as experienced with these options.

kevinlawler commented 7 years ago

For -A, -B, -C I think the answer is to truncate the other lines as well, and that it would be fine to left-align and right-truncate, though I'm not as experienced with these options.

Realistically, the default buffer widths that people are going to use will be orders of magnitude smaller than the largest untruncated line, meaning that if you're displaying untruncated lines there, it's not like the output is going to be readable anyway.

petdance commented 7 years ago

Is it really meaningful to show the match on the line? Vs. just saying "There are 14 matches in this 47,320 character line", for example?

kevinlawler commented 7 years ago

For me, yes. I use the context to determine whether it was a "desirable" match or not. The group of people I've known over the years that use ack is fairly large and all developers, and the typical use case is "someone mentioned such and such string, or it came up somehow in my work, and now I need to know where all instances of 'isInitialBlankNavigation' occur in the codebase." I'm not going to be interested in docs (say), or matches where it's a substring of a longer word I'm not interested in, etc., and that filtering happens visually in the terminal.

petdance commented 7 years ago

So what does the sample output look like? How do we denote it's a partial line?

petdance commented 7 years ago

If this is our normal output:

t/illegal-regex.t
33-
34:    return subtest "test_ack_with( $testcase: @args )" => sub {
35-        my ( $stdout, $stderr ) = run_ack_with_stderr( @args );

maybe the partials look like

t/illegal-regex.t
33-
34*  ... whatever this that other subtest this other thing that goes very ....
35-        my ( $stdout, $stderr ) = run_ack_with_stderr( @args );

With actual ... at the front and the end, and a * instead of : as the divider against the numbers.

n1vux commented 7 years ago

Kevin,

I'm planning to include in Ack3's cookbook section hints on displaying selective context and may even get Andy to include features to do better at it too.

Workarounds for context in Ack 2 for KWIC/KWOC Keyword indexes (with short input lines):

-

     ack2 --output can sorta do KWOC/KWIC with evil before/after vars
     -

        --output '$&^I$'"'"'^I|| $`' # *KWOC*
        -

        --output '$`^I$&^I$'"'" # *pseudo KWIC*
        -

        but they’re nasty from Shell since mix quote and dollar
        -

        and tabs don’t truly line up if width variation exceeds a tab
        width

For your purpose, monsterline uglified JS/html/etc, I'll make a long line version of ack-standalone and ack for perl 'use' statements.

perl -pE 's/\n$//' ack-standalone | ack2 --output '$1 $2 $3' '(.{0,20})(\buse \w+(?:::\w+)*[^;]{0,40};?)(.{0,20})' | less

( Those are ^V^I tabs )

Note that it steps over 'use warnings;' when it immediately follows 'use strict' which may be ok because it finds each cluster. But can get each this way

perl -pE 's/\n$//' ack-standalone | ack2 --output '$1 $2 $3' '(.{0,20})(\buse \w+(?:::\w+)*[^;]{0,40};?)((?=\buse)|.{0,20})' | less

Bill

On Mon, Apr 24, 2017 at 9:24 PM, Kevin Lawler notifications@github.com wrote:

For me, yes. I use the context to determine whether it was a "desirable" match or not. The group of people I've known over the years that use ack is fairly large and all developers, and the typical use case is "someone mentioned such and such string, or it came up somehow in my work, and now I need to know where all instances of 'isInitialBlankNavigation' occur in the codebase."

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petdance/ack2/issues/596#issuecomment-296865214, or mute the thread https://github.com/notifications/unsubscribe-auth/AANS-MTqvMHv5vXy-UohyvQMAVb3810Pks5rzUsxgaJpZM4H6lqz .

-- Bill Ricker bill.n1vux@gmail.com https://www.linkedin.com/in/n1vux

kevinlawler commented 7 years ago

Use three dots on any side that's elided. Potentially color the dots. (Putting these "outside" is fine or you can put them inside and do the more complicated string math.) If you really want to get fancy you can put the number of dropped chars in brackets outside any elided side.

On Apr 24, 2017, at 6:46 PM, Andy Lester notifications@github.com wrote:

So what does the sample output look like? How do we denote it's a partial line?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

digeomel commented 5 years ago

More than 2 years later, what's the status on this? I'm also having the same issue.

petdance commented 5 years ago

No more work is being done on ack2, but a related request came in the other day on ack3, and I think it might helpful here if it were implemented.

I welcome input on that ticket: https://github.com/beyondgrep/ack3/issues/234

stephenostermiller commented 1 year ago

I too would like some sort of feature to deal with suppressing or truncating matches from long lines. My workaround is to filter the output of ack using grep to remove results that are longer than 300 characters.

ack my-seach-string | grep -vE '.{300,}'

but because using ack with a pipe turns off color by default, I usually turn that back on with a flag:

ack --color my-seach-string | grep -vE '.{300,}'

It would be nice to be able to put something in my .ackrc to ignore or truncate long lines by default that I could then override on the command line if I needed to.

petdance commented 1 year ago

@stephenostermiller Please go comment on the current ticket at https://github.com/beyondgrep/ack3/issues/325