BurntSushi / ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
The Unlicense
48.74k stars 2k forks source link

--count --multiline doesn't behave as documented #2852

Open jeanas opened 4 months ago

jeanas commented 4 months ago

Please tick this box to confirm you have reviewed the above.

What version of ripgrep are you using?

14.1.0

How did you install ripgrep?

Reproduces with distro package and cargo installed version.

What operating system are you using ripgrep on?

Fedora 40

Describe your bug.

From the --help output:

    -c, --count
        This flag suppresses normal output and shows the number of lines that
        match the given patterns for each file searched. Each file containing a
        match has its path and count printed on each line. Note that unless
        -U/--multiline is enabled, this reports the number of lines that match
        and not the total number of matches. In multiline mode, -c/--count is
        equivalent to --count-matches.

However, the behavior I'm seeing is that --count still behaves as "count the number matching lines" and not as "count the number of matches" even under multiline mode.

What are the steps to reproduce the behavior?

$ cat file.txt 
match match match match
match match match match

What is the actual behavior?

$ rg --count match file.txt
2

$ rg --count-matches match file.txt
8

$ rg --count --multiline match file.txt
2

What is the expected behavior?

The last command should print 8 (or the documentation should be changed).

JOSBEAK commented 4 months ago

I would like to work on this issue.. Can anyone help me how should I get started ?

jeanas commented 4 months ago

I haven't looked much at the code, but I think I get what's happening.

The help says

           WARNING: Because of how the underlying regex  engine  works,  multiline
           searches may be slower than normal line-oriented searches, and they may
           also  use  more  memory. In particular, when multiline mode is enabled,
           ripgrep requires that each file it searches is laid out contiguously in
           memory (either by reading it onto the heap or  by  memory-mapping  it).
           Things  that  cannot  be memory-mapped (such as stdin) will be consumed
           until EOF before searching can begin. In general, ripgrep will only  do
           these  things when necessary.  Specifically, if the -U/--multiline flag
           is provided but the regex does not contain patterns that would match \n
           characters, then ripgrep will automatically  avoid  reading  each  file
           into  memory before searching it.  Nevertheless, if you only care about
           matches spanning at most one line, then it is always better to  disable
           multiline mode.

And sure enough:

$ cat file.txt 
start end
start end

$ rg --count-matches "start|end" file.txt 
4

$ rg --count "start|end" file.txt 
2

$ rg --count --multiline "start|end" file.txt 
2

$ rg --count --multiline "^start|end$" file.txt 
4

In words: when the regex doesn't contain ^ or $, ripgrep notices that multiline mode is useless and runs the normal, non-multiline mode, but then this changes the semantics of --count.

meedstrom commented 2 months ago

Could it be related to #2779?