Closed Antiever closed 4 weeks ago
The --encoding=UTF-8
specifies the default encoding and does nothing. It's in the list to make sure the list is complete (so someone is not going to ask why UTF-8 is not in the list).
The --encoding=binary
option reverts to plain raw bytes without assuming any encoding whatsoever. This is also the same as ASCII --encoding=ASCII
that only uses the raw lower 7 bits of a byte in the input.
In fact, --encoding=binary
also supports UTF-8 input, because UTF-8 is just a multibyte encoding with 8 bits! Since we just read bytes for binary and UTF-8 (and ASCII), no decoding is necessary.
To summarize: --encoding=binary
does the same as the default --encoding=UTF-8
. This supports ASCII, UTF-8 and binary inputs.
The -U
option specifies binary patterns instead of Unicode (UTF-8) patterns. So that option is probably more important than --encoding
in your case.
If any --encoding
is specified, files with UTF BOM are always decoded according to the UTF BOM in the file. This means that UTF-16 and UTF-32 files are always searched, because these have UTF BOMs that they really should have to make them usable with any tool and text editor.
I agree that this can be confusing. So I propose the following additional explanation to the --encoding
option help:
--encoding=ENCODING
The encoding format of the input. The default ENCODING is binary
and UTF-8 which are the same. Note that option -U specifies binary
PATTERN matching (text matching is the default.) ENCODING can be:
`binary', `ASCII', `UTF-8', `UTF-16',
`UTF-16BE', `UTF-16LE', `UTF-32', `UTF-32BE',
`UTF-32LE', `LATIN1', `ISO-8859-1', `ISO-8859-2',
`ISO-8859-3', `ISO-8859-4', `ISO-8859-5', `ISO-8859-6',
`ISO-8859-7', `ISO-8859-8', `ISO-8859-9', `ISO-8859-10',
`ISO-8859-11', `ISO-8859-13', `ISO-8859-14', `ISO-8859-15',
`ISO-8859-16', `MAC', `MACROMAN', `EBCDIC',
`CP437', `CP850', `CP858', `CP1250',
`CP1251', `CP1252', `CP1253', `CP1254',
`CP1255', `CP1256', `CP1257', `CP1258',
`KOI8-R', `KOI8-U', `KOI8-RU'.
Thank you! It really clears up a lot of things. But still, how do I find the Unicode version of the string as well in the example above?
ugrep -i -b --hexdump=A0B0 -U "\.exe" "ugrep.exe"
1319699:
00142310 63 6d 64 2e 65 78 65 00 2f 63 00 00 00 00 00 00 |cmd.exe@/c@@@@@@|
00142320 63 6d 64 2e 65 78 65 00 20 2f 63 20 00 00 00 00 |cmd.exe@ /c @@@@|
1337765:
001469a0 2e 63 6f 6d 00 2e 65 78 65 00 2e 62 61 74 00 2e |.com@.exe@.bat@.|
1840189:
001c1430 14 00 15 00 16 00 17 00 75 67 72 65 70 2e 65 78 |T@U@V@W@ugrep.ex|
001c1440 65 00 42 5a 32 5f 62 7a 42 75 66 66 54 6f 42 75 |e@BZ2_bzBuffToBu|
As can be seen in the output Ascii strings only. And there's Unicode, too:
Ah, I see what you're trying to do here.
The character column in hexdumps has traditionally always been ASCII. Ugrep also highlights control codes, but Unicode (UTF-8/16/32) characters are not shown as a glyph, only the multibyte sequence. Perhaps there is a way to show UTF-8 decoded glyphs too, but not sure how that works out in practice though. Double wide glyphs have to be detected (which --query already does) to prevent garbled layouts. Double wide glyphs will also run over the right margin.
Also, in exe files there is no way to distinguish UTF-8 from UTF-16 strings because there is no UTF BOM. Only a guess can be made to what the multibyte sequence could possibly belong to.
In general, in reverse engineering, program debugging (and programming in general) this is a common standard task. A simple search for a (null-terminated) sequence of "readable" characters in normal mode and the same but separated by null bytes (x2 at the end of the string) for unicode. Here are a couple of screenshots from GUI programs (CLI utilities are not as visual):
I'm honestly surprised that ugrep doesn't have a way to do a such "simple" things. Maybe someday, if you have free time, you will make at least some mode for this kind of search.
BTW YARA seems to be able to do it
>yara64 -s R1 ugrep.exe
WideCharRule ugrep.exe
0x1422c8:$wide_and_ascii_string: .\x00e\x00x\x00e\x00
0x142313:$wide_and_ascii_string: .exe
0x142323:$wide_and_ascii_string: .exe
0x1469a5:$wide_and_ascii_string: .exe
0x1c143d:$wide_and_ascii_string: .exe
R1 content:
rule WideCharRule
{
strings:
$wide_and_ascii_string = ".exe" wide ascii
condition:
$wide_and_ascii_string
}
and it's open source
I'm pretty sure that YARA applies multiple patterns to search, each matching a different encoding.
That can be easily done with ugrep too, by specifying the patterns to search for. Something like this to match .exe
in UTF-8, UTF-16 LE and UTF-16 BE:
ugrep -U -e '\.exe' -e '\.\x00e\x00x\x00e\x00' -e '\x00\.\x00e\x00x\x00e' ...
I'm not convinced this is something that ugrep is expected to do automatically out-of-the-box. Creating such patterns for different encodings can be done with a custom script or a program like iconv (though the result should be in a search pattern form) to produce such patterns. Note that we only search for strings here, not regex (but ugrep takes a regex here, hence the .
should be escaped with \.
).Strings can be always be converted to different UTF (and other) encodings. You name it. No problem. However, regex is not so easily converted, for sure.
For files with mixed encoded fragments such as .exe that have some fragments in UTF-8, some UTF-16 and some raw binary "junk", no file decoder will be able to normalize that input to a single (UTF-8) stream so that a single pattern can be used to search it.
Any reason to keep this issue open? One perhaps write a "pattern expander" to produce alternate UTF-16 LE/BE string patterns to search for. Something like iconv
but specialized to produce patterns. Such a tool will be a separate program, something the use like the following to expand patterns inline into the ugrep
command (Unix/Linux only):
ugrep -U `pattern-expand .exe` ...
This produces
ugrep -U "\.exe|\.\x00e\x00x\x00e\x00|\x00\.\x00e\x00x\x00e" ...
I'm pretty sure that's what YARA effectively does.
It's been a long time since the ticket was created, just want to clarify - is ugrep still does not support multisearch with one command?
Something like --encoding=ASCII,UTF-8,UTF-16
Now I have to do two separate searches:
ugrep -i -b -o -W "Pattern" "x:\file.bin"
ugrep -i -b -o -W --encoding=UTF-16 "Pattern" "x:\file.bin"
I'm asking because the variant with pre-preparation of hex patterns initially does not take into account the case insensitivity and even if you generate all variants it will at least significantly reduce the efficiency (speed) of the search. So --encoding
method is preferable and the ability to combine multiple searches into one command would be very handy.
PS In addition question - how to display offsets in hex? This is a standard display in most programs, but I have not found such an option.
It is not possible to specify multiple encodings. If we would add such as feature, then it would require searching multiple times by executing ugrep for each encoding specified. So it would be exactly the same as executing ugrep multiple times (on the command line or in a script). If this is a common use case for you, you may want to create a script that executes ugrep for each encoding you want, using the same options specified by writing this in your bash script:
ugrep "$@"
ugrep --encoding=UTF-16 "$@"
ugrep --encoding=UTF-32 "$@"
Note that ASCII and UTF-8 and binary encodings are all the same (ASCII is a subset of UTF-8). Also, if UTF-16 and UTF-32 files have a BOM as these should, then you do not need to specify an encoding because this is auto detected.
If you want byte offsets in hex, then you could use --format="%A: %o%~"
with the latest release, but this does not obey option -W
yet. This will be possible in the upcoming release.
Thank you. I agree, there is logic in dividing the searches.
if UTF-16 and UTF-32...then you do not need to specify an encoding
By default (without specifying encoding arguments), the search returns only ASCII text. To search for Unicode text you need to specify --encoding=UTF-16
.
--format="%A: %o%~"...does not obey option -W yet. This will be possible in the upcoming release
Thank you.
I will wait for syntax highlighting support with this option in new versions, or for simplicity at least an additional argument for -b
to display hex offsets in default mode.
if UTF-16 and UTF-32...then you do not need to specify an encoding
By default (without specifying encoding arguments), the search returns only ASCII text. To search for Unicode text you need to specify
--encoding=UTF-16
.
No, the search is always supports Unicode, unless option -U
is used. The --encoding=UTF-16
option does nothing if the file has a UTF-16 BOM. If files are not UTF-16, then --encoding=UTF-16
may return partial matches or even garbage as it tries to strictly match UTF-16.
--format="%A: %o%~"...does not obey option -W yet. This will be possible in the upcoming release
Thank you. I will wait for syntax highlighting support with this option in new versions, or for simplicity at least an additional argument for
-b
to display hex offsets in default mode.
Color highlighting in custom formatting is already included with the latest release, but you have to use the %[ms]= this is in ms color %=
to use colors. Here, ms
is the match color associated with ms
, see color explanation. So it is not restricted to the default colors and can be made to look anyway people want it to look. These colors are subject to --color=
option to enable/disable colors and to --colors=
to override color assignments.
No, the search is always supports Unicode, unless option -U is used
Here is small bin with 1 Unicode "Pattern" string and 3 ASCII
So it's always finds ASCII only matches unless 2-byte encoding (UTF-16) explicitly specified.
but you have to use the %[ms]
Maybe I didn't quite get it, according to the colors explanation it should look something like this (1 - test, 2,3 -attempts to use)
Correct, the third case where you look with --encoding=UTF-16
in a bin file matters, because the bin file has no BOM. Normally text files have BOM for UTF-16 and UTF-32 so for these text files it does not matter_ to specify --encoding=UTF-16
. Your use cases with bin files is specific, not typical. For anyone else reading this, it is important to be precise in the wording explaining how options work.
For the --format
colors you need to use the %[ms]=...%=
color begin...end fields. See formatting and colors.
Closing this as completed.
Pretty common case - I need to find a string (in a binary file) regardless of encoding (Unicode\Ascii). I still don't understand how to do it right, by logic
--encoding=binary
should do exactly that. But in fact--encoding
flag do nothing. The output is the same regardless of the encoding value. For exampleugrep.exe
contains the string.exe
in both encodings but the search only returns a match in the Ascii encoding._ugrep 3.7.7 WIN64 +sse2 +pcre2jit +zlib +bzip2 +lzma +lz4 +zstd