Genivia / ugrep

NEW ugrep 7.0: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.64k stars 112 forks source link

[FR] TUI split screen: suggestions for a possible new feature? #254

Closed genivia-inc closed 1 year ago

genivia-inc commented 1 year ago

The ability to split the screen in the query TUI to produce two views is interesting and has been on my mind for some time. This would add a second view to the TUI in a split screen to show the "current file", with pattern matches highlighted. The "current file" is the file listed in query search results. Perhaps what I am saying here makes more sense when navigating a list of query search results, like the new directory tree view:

image

When moving down to a file in the list, the split screen should show the results for this file specifically. For example, showing reflex/convert.h with pattern matches, line numbers and some context of 3 lines:

image

Perhaps the context should be the entire file, not just three lines before and after? Three lines seems somewhat arbitrary and limiting.

Should there be a way to scroll the split screen up and down to view the file? If so, what keys to press?

The split screen should be enabled/disabled with CTRL-T in the TUI (CTRL-T currently enables/disables colors.)Note that CTRL-Y shows the current file with the more utility. CTRL-Y is programmable with option --view, so we will keep this TUI control.

I am not sure how much of the file the split screen should show, besides the line(s) with the pattern match? Perhaps it should show the entire file, with only the top if it is too long? If it is long, the pattern matches may not show up at all when they are down in the file. And would the split screen be better on top, bottom or to the right of the screen? Perhaps toggle this with CTRL-T? What else comes to mind to implement or to avoid?

Suggestions are very welcome!

GwynethLlewelyn commented 1 year ago

@genivia-inc you should activate Discussions and discuss it there as opposed to doing the discussion inside an issue :)

For myself, I can only comment that this looks awesome in all regards (kind of imagining Emacs doing a regexp search!).

While for my own purposes, one line before and one line after would be more than enough, I can imagine some scenarios where more might be useful, for instance, when one knows in advance that whatever they're searching for will be in the context of, say, two or three different functions, where there is a common part — the one you're searching for — but the context itself is quite different, and only obvious if you 'scroll back', so to speak, some 5 or 7 lines (or more).

But because these would be edge cases — at least, that's what I think — i wonder if you couldn't simply leave that decision to the user. Start with the 'three-line default' but give the user a command-line option to add more lines before and after, e.g. --tui-lines-before=5 and --tui-lines-after=1, for example.

As for the choice of keys to press... well, in that regard, I'm biased, I like familiar keys :-) and that means somehow using GNU readline(). Granted, readline() is meant to deal with inputting a single line, but it can deal with sequences as well, including arrow keys.

Being GNU, of course, the keys all have an Emacs flavour :) I have no idea if there are popular vi-like libraries available (there probably are), but the idea is to address the following points:

On the other hand, it's been a long, long, long time since I last used readline() (or written anything from scratch in C/C++, for that matter), so I hardly know any details about speed, performance, or current trends/programming styles for 2023 😹

BTW, if you go for the vi-style, back/up should be the lowercase b, and next/down would be SPACE. It's really just a question of taste, really, but it's much better to stick to a style that is already being used everywhere than to come up with a new set of keys...

(Oh, how much I'd love that the nano/pico developers had followed that design...)

genivia-inc commented 1 year ago

@genivia-inc you should activate Discussions and discuss it there as opposed to doing the discussion inside an issue :)

Yes. I agree. Discord perhaps for discussions? Or just the GitHub discussions.

While for my own purposes, one line before and one line after would be more than enough, [...] But because these would be edge cases — at least, that's what I think — i wonder if you couldn't simply leave that decision to the user. Start with the 'three-line default' but give the user a command-line option to add more lines before and after, e.g. --tui-lines-before=5 and --tui-lines-after=1, for example.

Yeah, that is a great idea. By default the number of lines of context could be 3.

(There is actually an ALT key binding already to increase and decrease context with ALT-] and ALT-[. If a split window is used, this could be one line of context to start and then the user can increase/decrease context to view.)

As for the choice of keys to press... well, in that regard, I'm biased, I like familiar keys :-) and that means somehow using GNU readline(). Granted, readline() is meant to deal with inputting a single line, but it can deal with sequences as well, including arrow keys.

Being GNU, of course, the keys all have an Emacs flavour :) I have no idea if there are popular vi-like libraries available (there probably are), but the idea is to address the following points: [...]

Good points! Thank you for sharing suggestions and food for thought to move forward!

jftuga commented 1 year ago

I think GitHub Discussions would be best. Having everything centrally located would provide a better user experience and it would also be more discoverable via Google. Most everyone who would want to comment more than likely already has a GH account.

genivia-inc commented 1 year ago

Just no.

I mean, sure, discussing technical details and issues is great when it helps everyone. But no thanks to a discussion forum.

Doesn't anyone else code "in the zone" these days? Without any distractions such as discussion forums and social media?

On this topic: I love the suggestions given. I already have ideas to improve the TUI. It's just taking longer than intended, because people keep asking for other new features and for a faster ugrep. It's faster now with improved internals and now its faster (mostly) than other grep tools. This takes some guts to do and time to test everything to avoid possible bugs.

Another idea related to the TUI is to be able to navigate to the next match and previous match quickly, not just files. So CTRL-S/CTRL-W can be extended to do so. Or another key binding perhaps, but I'm running out of keys to assign.

nkh commented 1 year ago

I created #281 which is directly related to this issue.

I'm a long time FZF user and I've not only been following similar issues there but I've wanted, multiple times, more UI features. But I was wrong, I think, although the FZF author has made great concessions through the years it's getting difficult to get UI changes, understandably.

The problem is not with FZF or the UI you'd put on ug but with the variety of demands the end users have, and may they should be given a tool rather than a solution.

281 goes a long way to separate search and UI concerns; ug can implement whatever UI it pleases but its functionality is accessible to people who want a more specialized UI.

                                 .--------------------------------------------.
                                 v                                            |
                          .------------.                                      |
                          | user input |                                      |
                          '------------'                                      |
                                 |                                            |
                                 v                                            |
                        .----------------.                                    |
           .------------| frontend logic |-----------.                        |
           v            |                |           v                        |
   .---------------.    |                |   .---------------.   .---------.  |
   | other         |    |                |   | search engine |-->| results |  |
   | functionality |    |                |   '---------------'   '---------'  |
   '---------------'    |                |           |  ^             |       |
          |             |                |<----------'  '-------------'       |
          '------------>|                |                                    |
                        |                |                                    |
                        '----------------'                                    |
                                 |                                            |
                                 v                                            |
                        .----------------.                                    |
                        | result display |                                    |
                        '----------------'                                    |
                            .-----------------.                               |
                            | pr.-----------------.                           |
                            '---| pr.-----------------.                       |
                                '---| p.-----------------.                    |
                                    '--| preview display |                    |
                                       '-----------------'                    |
                                 |                                            |
                                 '--------------------------------------------'
GwynethLlewelyn commented 1 year ago

One thing that I always worry, as more and more features are piled upon tools such as ugrep: when will all those features have an impact in raw performance?

Consider the following very informal statistics: I've got access to a few Unix-like machines, both at home and hosted remotely, where I can compare one very simple metric — how big is the (stripped) binary?

For ugrep 4.02, here is a comparative table showing what I found out:

Machine Operating System CPU & speed RAM ugrep (bytes) grep (bytes)
MacBook Pro (Retina, 15-inch, Mid 2014) Big Sur (11.7.9 (20G1426))
(Darwin/BSD with Mach µkernel)
2.80 GHz Quad-Core Intel Core i7
(amd64)
16 GB 1031376 140320
Synology NAS DS218play DSM 7.1
(Linux kernel 4.4.302+)
1.40 GHz Quad-Core Realtek RTD1296 (SoC)
(aarch64)
~0.864 GB 792392 167184
OEM bare metal server Ubuntu 22.04.3 LTS (jammy)
(Linux kernel 5.15.0-79)
3.60 GHz 8-Core Intel(R) Xeon(R) CPU E3-1275 v5
(amd64)
64 GB 1036520 182728
Raspberry Pi Zero 2 W Raspberry OS
based on Debian GNU/Linux 11 (bullseye)
(Linux kernel 6.1.21-v8+)
1.00 GHz Quad-Core Cortex-A53
(aarch64)
0.5 GB 935880 182496
Bluehost box (jailed environment) Unknown distribution
(Linux kernel 4.19.150-76)
2.10 GHz 16-Core Intel(R) Xeon(R) Gold 5318Y
(amd64)
~58 GB ~987136 ~159744

A small note: the only 'packaged' version of ugrep I get is the one on macOS, because Homebrew releases the latest versions of ugrep as quickly as they're posted on GitHub. On all the others, ugrep is compiled directly from the GitHub sources and targeted for the specific platform (no cross-compilations; each system gets its 'natively' compiled copy), so there may have been crucial optimisations that I might have missed.

grep is, by contrast, whatever comes as default with each system. I believe that all of them are compiled from Stallman's GNU grep.

Also note that it is expectable that RISC architectures (the two ARM64 examples) may have slightly larger binaries than CISC architectures (such as the examples using Intel 64-bit — this is by design, since, in general, a RISC architecture will produce more machine language instructions for a piece of code than an equivalent CISC compiler...

Interestingly, though, the smallest file was actually produced under the Synology NAS. I believe I can explain that: Synology actually does not include a C/C++ compiler, I had to use one from the OpenWRT project, which may be heavilly optimised to produce code as small as possible.

Still, it's worth taking into account that ugrep's size on disk is 6-7 times the size of grep. Nevertheless, just because it's bigger, it doesn't automatically mean that it's slower, but rather that it packs a lot of features in it! The big question I'm asking, of course, is how much of an impact the "extra goodies" will have on the overall performance.

I guess that at some point in time there has to be a split between a 'fast CLI-only ugrep' and an 'extended ugrep' which includes a TUI and some sort of plugin system and what not... where extra features will be traded off for 'slightly worse performance but still much faster than grep!'

That's just the way I see things. If, after benchmarking, no matter how 'big' the actual binary becomes with the nifty extra features, its performance is unchanged (or even made faster!), then, of course, you should disregard all of the above 😆

genivia-inc commented 1 year ago

Thanks for the info. This helps.

I don't expect that adding a new TUI feature will make the binary a lot larger.

Let me explain why. Here is a rough breakdown:

-rw-r--r--  1 engelen  staff    29848 Aug 24 10:57 ugrep-cnf.o
-rw-r--r--  1 engelen  staff     2312 Aug 24 10:57 ugrep-glob.o
-rw-r--r--  1 engelen  staff   107944 Aug 24 10:57 ugrep-output.o
-rw-r--r--  1 engelen  staff   107528 Aug 24 10:57 ugrep-query.o
-rw-r--r--  1 engelen  staff    13736 Aug 24 10:57 ugrep-screen.o
-rw-r--r--  1 engelen  staff    11864 Aug 24 10:57 ugrep-stats.o
-rw-r--r--  1 engelen  staff   619336 Aug 24 10:57 ugrep-ugrep.o
-rw-r--r--  1 engelen  staff     8568 Aug 24 10:57 ugrep-vkey.o
-rw-r--r--  1 engelen  staff     5128 Aug 24 10:57 ugrep-zopen.o

and the RE/flex regex library objects, which includes all the Unicode-related logic and also provides the necessary glue to put PCRE2 or Boost.Regex into action:

-rw-r--r--  1 engelen  staff   80432 Aug 24 16:57 lib/libreflex_a-block_scripts.o
-rw-r--r--  1 engelen  staff   88952 Aug 24 16:57 lib/libreflex_a-convert.o
-rw-r--r--  1 engelen  staff    1424 Aug 24 16:57 lib/libreflex_a-debug.o
-rw-r--r--  1 engelen  staff    5528 Aug 24 16:57 lib/libreflex_a-error.o
-rw-r--r--  1 engelen  staff   28672 Aug 24 16:57 lib/libreflex_a-input.o
-rw-r--r--  1 engelen  staff  136056 Aug 24 16:57 lib/libreflex_a-language_scripts.o
-rw-r--r--  1 engelen  staff   13248 Aug 24 16:57 lib/libreflex_a-letter_scripts.o
-rw-r--r--  1 engelen  staff   44752 Aug 24 16:57 lib/libreflex_a-matcher.o
-rw-r--r--  1 engelen  staff     544 Aug 24 16:57 lib/libreflex_a-matcher_avx2.o
-rw-r--r--  1 engelen  staff     544 Aug 24 16:57 lib/libreflex_a-matcher_avx512bw.o
-rw-r--r--  1 engelen  staff  125000 Aug 24 16:57 lib/libreflex_a-pattern.o
-rw-r--r--  1 engelen  staff    8368 Aug 24 16:57 lib/libreflex_a-posix.o
-rw-r--r--  1 engelen  staff     648 Aug 24 16:57 lib/libreflex_a-simd_avx2.o
-rw-r--r--  1 engelen  staff     544 Aug 24 16:57 lib/libreflex_a-simd_avx512bw.o
-rw-r--r--  1 engelen  staff   15520 Aug 24 16:57 lib/libreflex_a-unicode.o
-rw-r--r--  1 engelen  staff   12752 Aug 24 16:57 lib/libreflex_a-utf8.o

The core ugrep functionality without bells-n-whistles is 600K on this ARM64 machine (Pro M1). this includes decompression, regex matching (optimized), PCRE2, file and directory reading. .gitignore rules, boolean query logic, and so on. None of that will increase in the future, at least not by much.

The SIMD inlined code to speed up matching is not small because of multi-code versioning. On an x64 we do a runtime check for SSE2, AVX2 and AVX512 capabilities to pick the appropriate matcher for speed and to ensure binary portability. Note: the 600K in the list above includes AArch64 vector code and optimizations, not x64/SSE2/AVX2/AVX512.

The entire TUI takes 100K+. Adding a split window will probably only add a 10s of K.

Async output is 100K, but a bit more than that because a large part of the async output is inlined in ugrep and part of the 600K. The async output is necessary for worker threads to output their results, which will be synchronized into one output stream. This also requires sorting the output when --sort is specified. The way this works is efficient, keeping a bit vector representing a window of completed work, to effectively limit latencies. This way, a worker thread can output immediately if it knows it is its "slot" to output i.e. all other thread workers have output results already in the specified --sort ordering. Threads won't wait by outputting to a private output buffer that is merged into the output stream in the right "slot" (i.e. as per sort).

This is just one of the many other things happening in ugrep that grep doesn't do. Sure, if we would limit ugrep to grep's capabilities and just add a TUI it would add 100K perhaps to grep. But that also needs logic to control search thread start/stop when a key is pressed. All that adds code space too.

Nevertheless, keeping the binary small(ish) around the current size is a good thing. I don't believe it will be a problem to do so.

GwynethLlewelyn commented 1 year ago

Thanks for the awesomely detailed explanation!

I suspected as much, just because I'm always utterly amazed at how fast ugrep is at recursively going through directories — even without the pre-indexed thingy that you have implemented (which I haven't given a try yet). In fact, ugrep is so fast that I'm always searching for ways to get it to search for files on the entire filesystem (which gets pre-indexed daily via a variant of the locate subsystem), because it's usually faster than locate anyway — not to mention find, of course.

Huh. In fact, I just realised that I could get replace locate entirely: just get a list of all files with plain old find every day (or rather, over the night), pipe it to a text file, and then simply use ugrep on that file. Simple! And very likely way faster than locate (and better regexp support, too!).

Extra bonus: heavily compress that master index file every day before generating a new one, and store it for those scenarios when you suddenly realise that you had this file in your system yesterday, but it seems to have been gone today... since ugrep is so fast with large compressed files anyway, it would be worth the trouble :-)

There you go — a new side-project for you: create ulocate, a locate-compatible replacement using ugrep instead of whatever they do (there are several variants these days, all claiming to be better and faster than the 'original' locate).

GwynethLlewelyn commented 1 year ago

And now another more serious suggestion: support Brotli compression. I'm aware that it is more popular for on-the-fly compression of HTTP streams (essentially because Google has included it in Chrome, and since everybody uses Chrome these days, both nginx and Apache support it as well, which covers perhaps 95% of all web servers out there), although I use it regularly when rotating logs — the extra compression is very welcome on my systems :)

And then, of course, every time I need to grep through old logs, I have to decompress them first...

Not to mention those cases when I just happen to use Brotli to compress tar archives and then forget that none of my usual tools is able to open those without decompressing them first :)

genivia-inc commented 1 year ago

Thank you for your suggestions. I will take a look at supporting Brotli compression. This is interesting. The license is compatible. Added to my TODO wishlist.

GwynethLlewelyn commented 1 year ago

Great! I sincerely hope it's easy to implement as well; I cannot say, these days I've all but neglected my C/C++ skills to the point I barely can "read" C++ without a dictionary :) All I can say is that Brotli's CLI is deliberately modeled after others (so that it can be used as a direct replacement on scripts) and the same goes for the API wrapping in the Go programming language (with which I've been toying around) — replacing one compression/decompression engine with another is a breeze. But — I have no idea if this is the case with C/C++, maybe it's a nightmare to implement and that's one of the reasons that Brotli has not acquired widespread usage outside the Google ecosystem (which, of course, includes all Chromium-based browsers).

genivia-inc commented 1 year ago

I've made some progress on the new TUI split screen feature.

The progress/status bar will always be visible from now on. The bar splits the TUI into two halves when --split is specified (e.g. in a .ugrep config file) or when ctrl-T is pressed to toggle the split screen.

The bottom half shows a screenful of a file's contents from the first match on. When line numbers are displayed in the top half, then the bottom half shows the file with the matching line and after. This is also very nice when searching with context for example:

image

The bottom half will follow the files and line numbers displayed in the top half. Also the contents of compressed files and (nested) archives are displayed in the bottom half, for example here is a zip file being searched with the top half showing the list of archived contents (option -l) and the bottom half showing the contents selected with the search for Hello:

image

I hope to get everything sorted out and fully tested in a few days. One thing I still need to do is to reuse the search engine patterns to search a specific file to display in the bottom half. This is important when the search pattern is complex, such as Boolean queries that require CNF normalization and the construction of multiple pattern matchers for conjunctive matching. It makes no sense to do this every time the TUI needs to display the bottom half starting with the matching line.

That's it for now. In the meantime, if you have any suggestions, comments and questions then let me know!

genivia-inc commented 1 year ago

One of the most amazing things that is possible with the new preview panel of the TUI is that you can preview the contents of files in nested archives at any archived nesting depth. For example at --zmax=3 levels deep in the ugrep/tests directory:

ug -Qlz --zmax=3 tests
image

I've completed the implementation. Done a lot of testing, but just in case more testing today and tomorrow.

The effort was mostly to make sure the TUI runs smooth and to allow all command-line options to work seamlessly with the new preview panel. I've also optimized the preview search as much as possible by reusing the search engine to show the file in the preview pane. This requires refactoring the source code to keep the last engine state alive until ugrep quits. This allows reuse of the engine, so that when archives are searched, the TUI can efficiently search the file to display and can also search an archive again (even nested) until the file is found in a (nested) archive to display in the preview pane.

The new x64 and arm64 executables are only about 2K (arm64) to 8K (x64 exe) larger.

acelticsfan commented 1 year ago

Nice work!

genivia-inc commented 1 year ago

ugrep v4.3 is released.

GwynethLlewelyn commented 1 year ago

Fantastic :) The fun bit is to search for something that doesn't exist, then you'll even get a reasonably competent file manager hehe (I've seen so much worse!).

On macOS, using the kitty terminal (aye, written by the same author of Calibre), and connecting to a Synology NAS (arm64) where I have ugrep 4.3 freshly compiled, it runs smoothly and quickly — and simple things like colour support, or using the mouse to move up/down, everything works flawlessly. Much, much better than I would have imagined it :) There are lots of nifty "easter eggs" here and there which are actually fun to discover hehe...

Knowing that I almost never will use the TUI (except for some fun!), I have to say, the tiny increase in the memory size is not really noticeable, even on a device with constrained memory such as a home NAS.

I'm eagerly waiting for the multiple-hour compilation of 4.3 on my poor overworked Raspberry Pi Zero 2 W to finish 🤣 I know, I know, I should do some cross-compiling, but I'm lazy to keep all libraries in sync with multiple architectures — there will always be a missing one! — so, I prefer to wait for a successful compilation to finish. The only statically-typed, compiled language where I do frequently cross-compile is Go — but that's because Go cheats: everything is linked statically — generating an insanely large binary!

That reminds me of another nifty featurette to implement, I should add that to the long list...

Modern terminal software, like the above-mentioned kitty, but also many others which support the Sixel specifications, are able to directly draw images and even videos inside the terminal itself (i.e. not by calling an external viewer or so). This nifty trick is then used by TUIs to directly preview images/videos inside the terminal. I've started thinking that this was merely a 'cute' enhancement; these days, however, I find it absolutely essential, and I'm constantly adding tools which have kitty support and getting more and more amazed at what they can do.

Quickly opening a freshly converted file using ImageMagick in ultra-sharp, pixel-perfect resolution without leaving the shell has become part of my daily routine — especially when working on remote servers, where it's even more cumbersome to have to mount a remote folder (or use SFTP...), then use the native GUI-based file browser just to preview the image to see if it's correct, and return to the shell on the terminal... it saves a lot of time, and, these days, even obscure functionality in certain tools — such as displaying embedded pictures in PGP keys (aye, you can do that!) when doing gpg --list-keys, at least on a Mac — will use these protocols to directly draw to the terminal (and no, it's not restricted to the X Window System, either!).

While actually implementing this functionality requires some coding (no matter how straightforward the API might be), you could, in theory, use delegates to do the job for you, such as timg or tpix. The former will figure out if the TTY is something that supports the kitty, Sixel, or iTerm2 graphics protocol, and use the appropriate set of functions to render images inside those terminals; if all else fails, it falls back to a very neat approximation using Unicode blocks of either half the size or a quarter of the size of a "full" character — thus always rendering images/movies in whatever terminal you're on (so long as it supports ANSI colours with at least 8 bits). tpix is a much simpler tool that will work only on TTYs supporting the kitty terminal protocol, but it has the advantage of releasing statically built binaries for Intel/ARM Linux or macOS which should work even in Unix-like remote environments where you don't have a compiler and/or the ability to install anything.

Well... it's just a thought... but since you were asking for more features... I'm sure you'll have your hands full until ugrep version 12 🤣

genivia-inc commented 1 year ago

Interesting suggestions and thanks for your comments. Happy to see RPi zero works for you, which I played with a lot over the years (including other RPi and other SBC). I actually tested ugrep on a RPi zero for the initial ugrep releases, which initially didn't compile for various reasons, so I fixed that in ugrep back then. I spent hours figuring that out, longer than I had expected, because of the strange ways -mach arguments work (or don't) to detect neon/AArch64 and SSE/AVX. It's a bit hacky with custom code in configure.ac, but works like a charm to detect HW features.

I don't always use the TUI either like you, but there are situations where I know it is going to take some effort to find what I need. So running (u)grep repeatedly on the command line is a bit of a pain. That's when the TUI becomes more useful, to refine the search pattern and options without having to do this on the command line and getting huge dumps of data. But if we already know what we are looking for, then the TUI is not absolutely needed. It is not the primary reason I wrote ugrep, but I thought it couldn't hurt to have one and see how far we can take it.

Because the TUI is a new thing, it is hard to anticipate all possible use cases upfront to make sure the TUI covers them properly. So figuring out later what the TUI should do (or do differently) is not unexpected. Therefore, feedback is very useful to move in the right direction with the TUI.

genivia-inc commented 1 year ago

@GwynethLlewelyn there is more to look for in the TUI features arriving soon, such as regex syntax highlighting #300. I've picked a default syntax highlighting color scheme that should work well with both dark and light themed terminals. Comments/suggestions/rants are welcome!

genivia-inc commented 8 months ago

I've enabled the GitHub Discussions feature for people to ask questions and shore suggestions on ugrep.

I was reluctant in the past, because of the potential of continuous distractions that it may pose to me, when I want to work on challenging things most of the time. We'll see.