Find text in both encodings

Antiever commented 2 years ago

Pretty common case - I need to find a string (in a binary file) regardless of encoding (Unicode\Ascii). I still don't understand how to do it right, by logic--encoding=binary should do exactly that. But in fact--encoding flag do nothing. The output is the same regardless of the encoding value. For example ugrep.exe contains the string .exe in both encodings but the search only returns a match in the Ascii encoding.

ugrep -i -b --hexdump=A0B0  "\.exe" "ugrep.exe"
ugrep -i -b --hexdump=A0B0 --encoding=UTF-8 "\.exe" "ugrep.exe"

_ugrep 3.7.7 WIN64 +sse2 +pcre2jit +zlib +bzip2 +lzma +lz4 +zstd

genivia-inc commented 2 years ago

The --encoding=UTF-8 specifies the default encoding and does nothing. It's in the list to make sure the list is complete (so someone is not going to ask why UTF-8 is not in the list).

The --encoding=binary option reverts to plain raw bytes without assuming any encoding whatsoever. This is also the same as ASCII --encoding=ASCII that only uses the raw lower 7 bits of a byte in the input.

In fact, --encoding=binary also supports UTF-8 input, because UTF-8 is just a multibyte encoding with 8 bits! Since we just read bytes for binary and UTF-8 (and ASCII), no decoding is necessary.

To summarize: --encoding=binary does the same as the default --encoding=UTF-8. This supports ASCII, UTF-8 and binary inputs.

The -U option specifies binary patterns instead of Unicode (UTF-8) patterns. So that option is probably more important than --encoding in your case.

If any --encoding is specified, files with UTF BOM are always decoded according to the UTF BOM in the file. This means that UTF-16 and UTF-32 files are always searched, because these have UTF BOMs that they really should have to make them usable with any tool and text editor.

genivia-inc commented 2 years ago

I agree that this can be confusing. So I propose the following additional explanation to the --encoding option help:

    --encoding=ENCODING
            The encoding format of the input.  The default ENCODING is binary
            and UTF-8 which are the same.  Note that option -U specifies binary
            PATTERN matching (text matching is the default.)  ENCODING can be:
            `binary', `ASCII', `UTF-8', `UTF-16',
            `UTF-16BE', `UTF-16LE', `UTF-32', `UTF-32BE',
            `UTF-32LE', `LATIN1', `ISO-8859-1', `ISO-8859-2',
            `ISO-8859-3', `ISO-8859-4', `ISO-8859-5', `ISO-8859-6',
            `ISO-8859-7', `ISO-8859-8', `ISO-8859-9', `ISO-8859-10',
            `ISO-8859-11', `ISO-8859-13', `ISO-8859-14', `ISO-8859-15',
            `ISO-8859-16', `MAC', `MACROMAN', `EBCDIC',
            `CP437', `CP850', `CP858', `CP1250',
            `CP1251', `CP1252', `CP1253', `CP1254',
            `CP1255', `CP1256', `CP1257', `CP1258',
            `KOI8-R', `KOI8-U', `KOI8-RU'.

Antiever commented 2 years ago

Thank you! It really clears up a lot of things. But still, how do I find the Unicode version of the string as well in the example above?

ugrep -i -b --hexdump=A0B0 -U "\.exe" "ugrep.exe"
1319699:
00142310  63 6d 64 2e 65 78 65 00  2f 63 00 00 00 00 00 00  |cmd.exe@/c@@@@@@|
00142320  63 6d 64 2e 65 78 65 00  20 2f 63 20 00 00 00 00  |cmd.exe@ /c @@@@|
1337765:
001469a0  2e 63 6f 6d 00 2e 65 78  65 00 2e 62 61 74 00 2e  |.com@.exe@.bat@.|
1840189:
001c1430  14 00 15 00 16 00 17 00  75 67 72 65 70 2e 65 78  |T@U@V@W@ugrep.ex|
001c1440  65 00 42 5a 32 5f 62 7a  42 75 66 66 54 6f 42 75  |e@BZ2_bzBuffToBu|

As can be seen in the output Ascii strings only. And there's Unicode, too:

010Editor_2022-04-03_05-16-15

genivia-inc commented 2 years ago

Ah, I see what you're trying to do here.

The character column in hexdumps has traditionally always been ASCII. Ugrep also highlights control codes, but Unicode (UTF-8/16/32) characters are not shown as a glyph, only the multibyte sequence. Perhaps there is a way to show UTF-8 decoded glyphs too, but not sure how that works out in practice though. Double wide glyphs have to be detected (which --query already does) to prevent garbled layouts. Double wide glyphs will also run over the right margin.

Also, in exe files there is no way to distinguish UTF-8 from UTF-16 strings because there is no UTF BOM. Only a guess can be made to what the multibyte sequence could possibly belong to.

Antiever commented 2 years ago

In general, in reverse engineering, program debugging (and programming in general) this is a common standard task. A simple search for a (null-terminated) sequence of "readable" characters in normal mode and the same but separated by null bytes (x2 at the end of the string) for unicode. Here are a couple of screenshots from GUI programs (CLI utilities are not as visual):

PPEE_2022-04-03_06-35-15

PEAnatomist_2022-04-03_06-39-23

I'm honestly surprised that ugrep doesn't have a way to do a such "simple" things. Maybe someday, if you have free time, you will make at least some mode for this kind of search.

Antiever commented 2 years ago

BTW YARA seems to be able to do it

>yara64 -s R1 ugrep.exe
WideCharRule ugrep.exe
0x1422c8:$wide_and_ascii_string: .\x00e\x00x\x00e\x00
0x142313:$wide_and_ascii_string: .exe
0x142323:$wide_and_ascii_string: .exe
0x1469a5:$wide_and_ascii_string: .exe
0x1c143d:$wide_and_ascii_string: .exe

R1 content:

rule WideCharRule
{
    strings:
        $wide_and_ascii_string = ".exe" wide ascii

    condition:
        $wide_and_ascii_string
}

and it's open source

genivia-inc commented 2 years ago

I'm pretty sure that YARA applies multiple patterns to search, each matching a different encoding.

That can be easily done with ugrep too, by specifying the patterns to search for. Something like this to match .exe in UTF-8, UTF-16 LE and UTF-16 BE:

ugrep -U -e '\.exe' -e '\.\x00e\x00x\x00e\x00' -e '\x00\.\x00e\x00x\x00e' ...

I'm not convinced this is something that ugrep is expected to do automatically out-of-the-box. Creating such patterns for different encodings can be done with a custom script or a program like iconv (though the result should be in a search pattern form) to produce such patterns. Note that we only search for strings here, not regex (but ugrep takes a regex here, hence the . should be escaped with \.).Strings can be always be converted to different UTF (and other) encodings. You name it. No problem. However, regex is not so easily converted, for sure.

For files with mixed encoded fragments such as .exe that have some fragments in UTF-8, some UTF-16 and some raw binary "junk", no file decoder will be able to normalize that input to a single (UTF-8) stream so that a single pattern can be used to search it.

genivia-inc commented 2 years ago

Any reason to keep this issue open? One perhaps write a "pattern expander" to produce alternate UTF-16 LE/BE string patterns to search for. Something like iconv but specialized to produce patterns. Such a tool will be a separate program, something the use like the following to expand patterns inline into the ugrep command (Unix/Linux only):

ugrep -U `pattern-expand .exe` ...

This produces

ugrep -U "\.exe|\.\x00e\x00x\x00e\x00|\x00\.\x00e\x00x\x00e" ...

I'm pretty sure that's what YARA effectively does.

Antiever commented 1 month ago

It's been a long time since the ticket was created, just want to clarify - is ugrep still does not support multisearch with one command? Something like --encoding=ASCII,UTF-8,UTF-16

Now I have to do two separate searches: ugrep -i -b -o -W "Pattern" "x:\file.bin" ugrep -i -b -o -W --encoding=UTF-16 "Pattern" "x:\file.bin"

I'm asking because the variant with pre-preparation of hex patterns initially does not take into account the case insensitivity and even if you generate all variants it will at least significantly reduce the efficiency (speed) of the search. So --encoding method is preferable and the ability to combine multiple searches into one command would be very handy.

PS In addition question - how to display offsets in hex? This is a standard display in most programs, but I have not found such an option.

genivia-inc commented 4 weeks ago

It is not possible to specify multiple encodings. If we would add such as feature, then it would require searching multiple times by executing ugrep for each encoding specified. So it would be exactly the same as executing ugrep multiple times (on the command line or in a script). If this is a common use case for you, you may want to create a script that executes ugrep for each encoding you want, using the same options specified by writing this in your bash script:

ugrep "$@"
ugrep --encoding=UTF-16 "$@"
ugrep --encoding=UTF-32 "$@"

Note that ASCII and UTF-8 and binary encodings are all the same (ASCII is a subset of UTF-8). Also, if UTF-16 and UTF-32 files have a BOM as these should, then you do not need to specify an encoding because this is auto detected.

If you want byte offsets in hex, then you could use --format="%A: %o%~" with the latest release, but this does not obey option -W yet. This will be possible in the upcoming release.

Antiever commented 4 weeks ago

Thank you. I agree, there is logic in dividing the searches.

if UTF-16 and UTF-32...then you do not need to specify an encoding

By default (without specifying encoding arguments), the search returns only ASCII text. To search for Unicode text you need to specify --encoding=UTF-16.

--format="%A: %o%~"...does not obey option -W yet. This will be possible in the upcoming release

Thank you. I will wait for syntax highlighting support with this option in new versions, or for simplicity at least an additional argument for -b to display hex offsets in default mode.

genivia-inc commented 4 weeks ago

if UTF-16 and UTF-32...then you do not need to specify an encoding

By default (without specifying encoding arguments), the search returns only ASCII text. To search for Unicode text you need to specify --encoding=UTF-16.

No, the search is always supports Unicode, unless option -U is used. The --encoding=UTF-16 option does nothing if the file has a UTF-16 BOM. If files are not UTF-16, then --encoding=UTF-16 may return partial matches or even garbage as it tries to strictly match UTF-16.

--format="%A: %o%~"...does not obey option -W yet. This will be possible in the upcoming release

Thank you. I will wait for syntax highlighting support with this option in new versions, or for simplicity at least an additional argument for -b to display hex offsets in default mode.

Color highlighting in custom formatting is already included with the latest release, but you have to use the %[ms]= this is in ms color %= to use colors. Here, ms is the match color associated with ms, see color explanation. So it is not restricted to the default colors and can be made to look anyway people want it to look. These colors are subject to --color= option to enable/disable colors and to --colors= to override color assignments.

Antiever commented 4 weeks ago

No, the search is always supports Unicode, unless option -U is used

Here is small bin with 1 Unicode "Pattern" string and 3 ASCII 010Editor_LRuAxsUd4H

cmd_V7GmBiHRLc So it's always finds ASCII only matches unless 2-byte encoding (UTF-16) explicitly specified.

but you have to use the %[ms]

Maybe I didn't quite get it, according to the colors explanation it should look something like this (1 - test, 2,3 -attempts to use)

cmd_NlmyppXO6M

genivia-inc commented 4 weeks ago

Correct, the third case where you look with --encoding=UTF-16 in a bin file matters, because the bin file has no BOM. Normally text files have BOM for UTF-16 and UTF-32 so for these text files it does not matter_ to specify --encoding=UTF-16. Your use cases with bin files is specific, not typical. For anyone else reading this, it is important to be precise in the wording explaining how options work.

For the --format colors you need to use the %[ms]=...%= color begin...end fields. See formatting and colors.

genivia-inc commented 4 weeks ago

Closing this as completed.

Genivia / ugrep

Find text in both encodings #201