Feature: friendlier treatment of huge files

epa commented 9 years ago

Currently ack searches files in the order it finds them. But you may have a working directory containing your code, plus some huge log file or input file or whatever. Searching could spend a long time looking through that file before it gets to the smaller and quicker files.

It would be better to tweak the order so that big files (more than a megabyte, say) are searched last. So ack would first of all look in all the small files, and then go on to check the big ones at the end, doing the very biggest last of all. This would increase the chance of being able to see the match you were looking for and hit Ctrl-C. Of course, all the same matches would still be found as before.

A further enhancement would be to print a message 'scanning large file X' to the console if X is bigger than ten megabytes and ack has spent more than ten seconds reading it so far. Then the user has an opportunity to interrupt and perhaps delete or move that file before rerunning ack.

petdance commented 9 years ago

Currently ack searches files in the order it finds them.

Unless it searches them in alphabetical order because you have --sort specified.

epa commented 9 years ago

Right. I would not suggest changing the behaviour if --sort is given. But without that flag, ack has liberty to search the files in any order it wishes.

pdl commented 9 years ago

I was thinking about this the other day in a repo with a mixture of small, interesting files of up to 20 lines and a sprinkling of huge files (thousands of lines) which I have no interest in searching with ack.

The solution that sprang to mind was a --max-lines option, but I guess this is impossible to know until you actually get into the file and have returned results already.

A --max-size option which accepts values like 100k, 4M, 1.5GB would also be useful (and maybe a message at the end of the order of 'skipped N files which were too large').

The advantage of doing it by option is that it could go into ackrc files and therefore be contextual to the user's work, and also not rely on.

The downside of scanning huge files last is that if you do find results there, they might be the only ones you see when you come back to your terminal a minute later, unless you piped ack to less or something.

petdance commented 9 years ago

Seems to me that if there are huge files that they should be being excluded in a project-specific .ackrc.

In general I'm going to be wary of adding functionality like this if it means slowing down the normal non-huge use cases that are ack's bread and butter.

pdl commented 9 years ago

That's not always practical. I have a repo of SQL patches which gets added to weekly; some are big (data changes), some are small (schema changes), the only difference is the file size - 95% of the time, I'm only interested in the small ones (and very occasionally I do want to go through the big ones too). Keeping an up-to-date ackrc in that case is no more practical than always having an ackrc of files ack doesn't recognise. It's much easier for ack to reliably do this at run time than it is for me to maintain a list of large files.

All a --max-size option means is doing -s on a file (if at all), so hopefully very little impact - I've an initial implementation at https://github.com/pdl/ack2/tree/max-size - though I need to optimise and write some tests that actually verifies the code works...

petdance commented 9 years ago

Remember that whatever it is has to work on Windows as well and needs to not slow things down.

epa commented 9 years ago

I echo what pdl said about large files. Typically I will have a working directory containing source code and some ad-hoc log files or data I'm working with. These will have filenames like log or out or ggggg so are not something that can be set to ignore in ackrc. Sometimes they are CSV files and end in .csv - but then, most of the time I do want CSV files to be included in searching.

My vision for this feature is that it is unobtrusive, cross-platform, and doesn't get in the way of searching non-huge files, which as Andy L. says is the normal ack usage. It is orthogonal to a --max-size option or any ackrc settings. The speed impact would be an extra stat() on each file, or perhaps even that could be avoided with some cleverness. Let me see if I can work up a patch.

hoelzro commented 9 years ago

@epa We already perform at least one stat() per file via get_file_id; if you can figure out how to use that (perhaps via stat _, which is a special form for retrieving the values from the previous stat() call), you can probably make your patch without impacting performance much!

epa commented 9 years ago

Thanks, I hoped that would be the case ;-)

daviddelikat commented 9 years ago

perhaps a different topic... I was thinking about an option like --max-lines which would cause ack to stop searching a file after x lines. This is handy for searching a long list of large html files when all you want to see is the meta data for each file.

petdance commented 7 years ago

I'm closing this because this is getting into area where ack is getting too smart for its own good.

The max/min filesize is already in another ticket.

The --max-lines has been saved to the wiki

beyondgrep / ack2

Feature: friendlier treatment of huge files #529