VirusTotal / yara-python

The Python interface for YARA
http://virustotal.github.io/yara/
Apache License 2.0
646 stars 179 forks source link

Unexpected 'fullword' results on OSX #184

Closed jgrunzweig closed 3 years ago

jgrunzweig commented 3 years ago

I've discovered unexpected results while using the 'fullword' keyword modifier in yara-python on OSX. Note that this issue likely only effects OSX, as I was unable to reproduce on Linux and Windows environments.

Example rule:

rule fullword_issue
{
    strings:
        $str = "Load" fullword

    condition:
        $str
}

Example file (Also attached): yara_issue.bin.zip

00000000  00 e8 56 a7 00 00 b5 cf  00 00 ee 8b 47 08 e8 49  |..V.........G..I|
00000010  a7 00 00 98 49 00 00 23  d7 89 7d e4 e8 3b a7 00  |....I..#..}..;..|
00000020  00 5a 91 00 00 c0 89 7d  e8 e8 2e a7 00 00 9a c3  |.Z.....}........|
00000030  00 00 e7 8d 45 b8 e8 21  a7 00 00 14 63 00 00 57  |....E..!....c..W|
00000040  28 73 c7 45 f0 4c 6f 61  64 e8 0e a7 00 00 ec 77  |(s.E.Load......w|
00000050  00 00 8a 83 8b 45 f8 e8  00 a7 00 00 10 08 00 00  |.....E..........|
00000060  f0 0f 85 03 28 00 00 e8  f0 a6 00 00 d9 7c 00 00  |....(........|..|
00000070  f0 c3 e8 b3 2c 00 00 e8  e0 a6 00 00 d3 6c 00 00  |....,........l..|
00000080

Steps to reproduce:

>>> import yara
>>> yara.__version__
'4.1.0'
>>> rules = yara.compile(source=open("[file].yar", 'r').read())
>>> rules.match(data=open("yara_issue.bin", 'rb').read())
[]

Removing the fullword modifier results in a match. It appears as though yara-python is incorrectly identifying the surrounding characters around 'Load' and inappropriately believing them to be alphanumeric.

Appreciate any help in resolving this issue.

wxsBSD commented 3 years ago

I think this boils down to the isalnum() checks in https://github.com/VirusTotal/yara/blob/master/libyara/scan.c#L672-L677 which are locale aware. I haven't figured out what locale python is using that is triggering this, but if you try setting LC_CTYPE=POSIX in your environment before running the python script it will work.

One possible solution is to not use isalnum() here and to roll our own version of it. Curious if @plusvic has thoughts on it, as I'm happy to do the implementation.

plusvic commented 3 years ago

That's interesting, I couldn't imagine that isalnum() is locale-aware. I agree on implementing our own version of isalnum(), go head with it if you want.

wxsBSD commented 3 years ago

I wrote a short script that checks to see what values isalnum() cares about on my system and the results are fascinating to me: https://gist.github.com/wxsBSD/5cc0aab5fc99eccaf20eb95b94ed79e8. All the lines with matches are because isalnum() did not consider that byte alphanumeric. Things behave as I would expect for parts of it. That is, 0x30-0x39, 0x41-0x5a and 0x61-0x7a are all considered alphanumeric, but things get weird around 0xaa and later when it starts NOT matching (because isalnum() is considering those bytes alphanumeric.

An interesting point is that https://gist.github.com/wxsBSD/e817cef3444b16a0669e52cbd9a7a958 shows a simple C program behaves as I would expect. I'm sure this is some unicode thing that I am terrible at but I'll go ahead and implement a check for 0x30-39, 0x41-0x5a and 0x61-0x7a only.

I did just notice this also lives in https://github.com/VirusTotal/yara/blob/master/libyara/re.c#L105 so I'll also investigate that. I suspect it is also misbehaving but only when running inside python.

metthal commented 3 years ago

I think I have dealt with something similar (or maybe the same) some time ago (2 years I think). This is a comment I left in the code after the fix.

// Enforce C locale because if we are called from Python bindings
// UTF-8 locale seems to be used. Mixing default <cctype> functions
// with non-default locales seems to be not portable across different systems.
//
// Linux: worked as usual
// macOS: expected 'char' instead of 'unsigned char'
// Windows: asserted on signed types
static std::locale cLocale("C");

Then, when calling std::isalnum() we do it as std::isalnum(x, cLocale);. So calling this function instead - https://en.cppreference.com/w/cpp/locale/isalnum. I'm not exactly sure how to enforce locale in C because I'm mostly versed in C++ but it might help.

wxsBSD commented 3 years ago

@metthal - I think you're right but rather than deal with it and any edge cases that may come up from my very limited understanding of locales I figure it is probably best to just be very explicit on what we consider alphanumeric (anything in 0-9a-zA-Z range). I realize this is very English centric but I think that was the original intention of the code.