dajva / rg.el

Emacs search tool based on ripgrep
https://rgel.readthedocs.io
GNU General Public License v3.0
465 stars 38 forks source link

Fix the issue where rg.el couldn't search Unicode characters. #167

Open chansey97 opened 2 months ago

chansey97 commented 2 months ago

A known issue of rg.el is that Unicode search is not supported on windows.

Some related issues: https://github.com/dajva/rg.el/issues/101 https://github.com/dajva/rg.el/issues/117

The reason is that at the moment NTEmacs limits non-ASCII file arguments to the current codepage, see https://github.com/emacs-mirror/emacs/blob/58a7b99823c5c42161e9acf2abf6c22afd4da4cd/src/w32.c#L1648.

Running subprocesses in non-ASCII directories and with non-ASCII file arguments is limited to the current codepage (even though Emacs is perfectly capable of finding an executable program file in a directory whose name cannot be encoded in the current codepage). This is because the command-line arguments are encoded before they get to the w32-specific level, and the encoding is not known in advance (it doesn't have to be the current ANSI codepage), so w32proc.c functions cannot re-encode them in UTF-16. This should be fixed, but will also require changes in cmdproxy. The current limitation is not terribly bad anyway, since very few, if any, Windows console programs that are likely to be invoked by Emacs support UTF-16 encoded command lines.

For similar reasons, server.el and emacsclient are also limited to the current ANSI codepage for now.

Emacs itself can only handle command-line arguments encoded in the current codepage.

This patch provides a workaround: Instead of passing Unicode arguments to ripgrep via Emacs, it via a temp .bat script, which was generated whenever rg-build-command. This allows rg.el now to search the entire Unicode planes (including rare scripts and Emojis), rather than being restricted to a specific codepage.

P.s. This feature is disabled by default for keeping old behavior, and can be enabled it via set rg-w32-unicode = t.

chansey97 commented 2 months ago

I would have to trust you that the solution is valid

Thanks. It works pretty well on Windows😀.

rg2

Would it be possible to generate a a generic script once instead and when invoking a search, just provide the search arguments to this script that can then forward them to ripgrep? My guess here is that the answer would be no

No, and your guess is correct.

Emacs on Windows (NTEmacs) create subprocess via CreateProcessA instead of CreateProcessW. The former will limit command-line arguments to current ANSI code page. This might be changed via settings Windows 10 "Beta: UTF-8" feature, but it is Beta and has a lot of problems (e.g. make old apps display mojibake). So using W version is always correct, but NTEmacs doesn't support at the moment.

Are there other alternatives to this that could work, like providing arguments via stdin or similar?

~Yes. I just re-checked ripgrep's arguments and found it supports PATTERNFILE.~

-f, --file=PATTERNFILE          Search for patterns from the given file.

~which could be used as an alternative to the rg-w32-ripgrep-proxy.~

~Perhaps rg.el can provide a switch that allows users to choose -e or -f mode, so that there is no need to distinguish between Windows and Linux (Windows users use -f mode). I will try it and update in the next version if it works.~

I delete the -f method because it has a drawback.

The bottom line is that on Windows we need to pack the entire command to a temp file (only "pattern" is not enough).

P.s. Providing arguments via pipe could be another path, but not as easy to debug as this one. Also it is much harder to implement and might require additional util (e.g. xargs) which even not exists on Windows.

Is there anything similar on Windows that would be needed (possibly only on some systems, depending on settings)?

File permissions are generally not an issue on Windows.

I would not be able to merge this patch for some time since I have problem with the CI setup, so I need to fix that first. Just so you know that it may take a while to get this in even after it has been accepted.

No problem. I can wait🙂.