VirusTotal / yara

The pattern matching swiss knife
https://virustotal.github.io/yara/
BSD 3-Clause "New" or "Revised" License
8.13k stars 1.42k forks source link

wide strings match only UTF16-LE #1891

Open ruppde opened 1 year ago

ruppde commented 1 year ago

Describe the bug The example string "Borland" from https://yara.readthedocs.io/en/v4.2.3/writingrules.html#wide-character-strings is there encoded as B\x00o\x00r\x00l\x00a\x00n\x00d\x00 but that's just the LE version of UTF16 with BE being\x00B\x00o\x00r\x00l\x00a\x00n\x00d (\x00 in front). So the example rule from the docs doesn't match UTF16-BE:

rule WideCharTextExample1
{
    strings:
        $wide_string = "Borland" wide

    condition:
        $wide_string
}

UTF16-LE is by far the most common case but I stumbled upon the string Qi Lijun in UTF16-BE in 2fb7a38e69a88e3da8fece4c6a1a81842c1be6ae9d6ac299afa4aef4eb55fd4b (however that happened ...)

image

(Actually this is more unexpected behavior than a bug but that fits better than a feature request.)

To Reproduce

rule WideCharTextExample1
{
    strings:
        $wide_string = "Qi Lijun" wide

    condition:
        $wide_string
}

Doesn't match:

$ yara test.yar 2fb7a38e69a88e3da8fece4c6a1a81842c1be6ae9d6ac299afa4aef4eb55fd4b

Expected behavior There would be several options to handle the problem:

  1. Back to the Borland example, the perfect solution would be to search for both UTF16-LE and UTF16-BE. UTF16-LE: B\x00o\x00r\x00l\x00a\x00n\x00d\x00 UTF16-BE: \x00B\x00o\x00r\x00l\x00a\x00n\x00d

  2. The faster and memory saving would be to strip the \x00 in the end of the existing implementation and search for: B\x00o\x00r\x00l\x00a\x00n\x00d

That might hit wrong on very short strings (which shouldn't happen that often because of the performance and false positive problems).

  1. Introduce e.g. widebe as a new string modifier, similar to uint16be.

  2. Explain the issue in the docs and recommend to use hex for UTF16-BE.

Please complete the following information:

Additional context This also affects string search on VT. This search doesn't show any results: content:"Qi Lijun" tag:peexe This shows 10 hits: content:{00 51 00 69 00 20 00 4c 00 69 00 6a 00 75 00 6e} (same string in hex(UTF16-BE) )

ruppde commented 1 year ago

More precise: This isn't a problem if the string to be matched is in the middle of UTF16 file, because there a null bytes all around. It's only a problem to match at the transition between a multi byte to a single byte charset (like in the example above) or at the end of the file.

For example this string

$endtag = "%>" ascii wide 

... wouldn't match on the UTF-16BE-encoded webshell below because it only searches for 25 00 3e 00 and 25 3e.

$ hexdump UTF-16BE.jsp 
00000000  3c 25 40 20 70 61 67 65  20 63 6f 6e 74 65 6e 74  |<%@ page content|
00000010  54 79 70 65 3d 22 63 68  61 72 73 65 74 3d 55 54  |Type="charset=UT|
00000020  46 2d 31 36 42 45 22 20  25 3e 00 3c 00 25 00 52  |F-16BE" %>.<.%.R|
00000030  00 75 00 6e 00 74 00 69  00 6d 00 65 00 2e 00 67  |.u.n.t.i.m.e...g|
00000040  00 65 00 74 00 52 00 75  00 6e 00 74 00 69 00 6d  |.e.t.R.u.n.t.i.m|
00000050  00 65 00 28 00 29 00 2e  00 65 00 78 00 65 00 63  |.e.(.)...e.x.e.c|
00000060  00 28 00 72 00 65 00 71  00 75 00 65 00 73 00 74  |.(.r.e.q.u.e.s.t|
00000070  00 2e 00 67 00 65 00 74  00 50 00 61 00 72 00 61  |...g.e.t.P.a.r.a|
00000080  00 6d 00 65 00 74 00 65  00 72 00 28 00 22 00 69  |.m.e.t.e.r.(.".i|
00000090  00 22 00 29 00 29 00 3b  00 25 00 3e              |.".).).;.%.>|

So the problem is rather low prio.

jaredscottwilson commented 4 months ago

I'd like to +1 this issue.

The situation you're running into here is the programName value within the SpcSpOpusInfo details .

We have blogged about this in "I Solemnly Swear My Driver Is Up to No Good". Right now we have to UTF-16BE + Hex encode the string before adding it to yara rules. It would be very helpful for ease of reading the rule and also in rule creation to add a utf16be modifier.