Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.57k stars 109 forks source link

Recursive --sort=best #341

Closed whatisaphone closed 8 months ago

whatisaphone commented 8 months ago

Here's a test where ugrep sorts a fuzzy match before an exact match. Is this a bug, or did I misunderstand how --sort=best is meant to work?

$ ugrep -V
ugrep 4.4.1 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,brotli
License: BSD-3-Clause; ugrep user manual:  https://ugrep.com
Written by Robert van Engelen and others:  https://github.com/Genivia/ugrep
Ugrep utilizes the RE/flex regex library:  https://github.com/Genivia/RE-flex

$ mkdir /tmp/sort-test
$ cd /tmp/sort-test
$ echo 'mike piazza' > a
$ mkdir dir
$ echo 'mushroom pizza' > dir/b

$ ugrep --sort=best --fuzzy=best pizza
a:mike piazza
dir/b:mushroom pizza
genivia-inc commented 8 months ago

The --sort option sorts per directory and subdirectories are displayed after the sorted list of matching files. So it looks OK to me.

See ug --help sort: "Subdirectories are sorted and displayed after matching files."

genivia-inc commented 8 months ago

I think that we generally prefer to sort each (sub)directory and list subdirectories after sorted files. This is particularly the case for sorting by date and size. Otherwise, when everything is mixed up, it is really hard to see structure in the results.

For fuzzy sorting it might be useful to know which file(s) are overall the best match. I get that. But the sorting method underneath is still the same. I had thought about option -c to display not just counts but also the edit distance or some measure of match accuracy. That will help. But what should the value be? A fraction count perhaps? Or a second value for the edit distance? How to make it clear what it means?

whatisaphone commented 8 months ago

Ah sorry, I missed that bit in the help output. I see now that --sort sorts the files, not the matches. It looks like ugrep is effectively sorting by a tuple of (directory, cost, filename), and I was hoping for (cost, directory, filename).

Personally I would rather see higher scoring matches first, because chances are if there's an exact match, that's what I actually care about. Especially if the query is short, like "user", I'd rather see all the matches for "user" together first, instead of seeing irrelevant off-by-one matches for /usr, used, query, cluster, etc. all mixed together.

I added --fuzzy to my main ug alias so I could catch typos whenever I use it, but maybe that's not the intended use case? I could probably hack this on my own using --format with %Z and piping to sort, but then I'm throwing away ugrep's nice highlighted output and merging matches on the same line, and it would take some plumbing to recreate it all. What are the chances of ugrep supporting this sort order directly?

genivia-inc commented 8 months ago

The --sort=best option sorts the best matching files per (sub)directory. The best matching files are shown first, per directory.

If we want to sort all files searched by best match, then the search hangs until the entire search query completes and all files have been inspected and searched. That would result in an unreasonably long delay when many files are searched, i.e. the TUI may appear to hang until all files have been inspected. If the query is sufficiently large to cover many files, like from the file system root, the search will just hang.

genivia-inc commented 8 months ago

Personally I would rather see higher scoring matches first, because chances are if there's an exact match, that's what I actually care about.

Yes, I get that. A possible way to accomplish this more easily is to produce a list of matching files with the edit distance cost and then sort by edit distance cost. If option -c would include the edit distance cost then sorting the output of ugrep will get you what you want.

Another approach would be to use --format='%p:%Z%~' to produce custom output that can be sorted afterwards. But the current version doesn't support %Z for file lists. I can update this to output the minimum edit distance cost. That allows sorting the list of files by best match afterwards. Something like this:

ug -Z PATTERN -l --format='%Z %p%~' | sort
whatisaphone commented 8 months ago

Yep, if you think my usecase is niche enough I don't mind rolling my own. I thought I'd ask first though. The annoying part is recreating ugrep's fancy match formatting, but it looks possible without any new features in ugrep:

ugrep -Z --format=%Z:%f:%n:%k:%d:%O%~ pizza \
| sort \
| while IFS=: read -r cost path line col length match; do
    printf '%s:%s:%s|%s|%s\n' "${path}" "${line}" "${match:0:col-1}" "${match:col-1:length}" "${match:col-1+length}"
done

Now I just need to add actual colors! The only thing this couldn't easily do is merge multiple matches on the same line, but I can live without that. Thanks for talking this through, you can close the issue if you're not going to implement it.

genivia-inc commented 8 months ago

Very clever and nice!

In the meantime I tinkered a bit on the ugrep formatting with %Z when combined with options -l and -c. To show what is possible with a some tweaks:

$ ugrep -Zbest --sort=best -c test --format='%Z %m %f%~' -1 tests/
0 1 tests/gen.sh
0 5 tests/verify.sh
1 2 tests/Hello.pdf
1 2 tests/archive.cpio
1 2 tests/archive.pax
1 2 tests/archive.tar
1 9 tests/lorem.latin1.txt
1 9 tests/lorem.utf16.txt
1 9 tests/lorem.utf32.txt
1 9 tests/lorem.utf8.txt

where the 1st column is the best (smallest) edit distance computed with -Zbest (which is 0 when there are exact matches), the 2nd column are the number of exact/approximate matches (not a 2nd sort key for --sort=best) and the 3rd column is the pathname.

genivia-inc commented 8 months ago

I will include this minor improvement of %Z in the next update.