Yara-Rules / rules

Repository of yara rules
GNU General Public License v2.0
4.18k stars 1.01k forks source link

Crypto/base64.yar too many false positives #239

Closed xambroz closed 7 years ago

xambroz commented 7 years ago

Rule contentis_base64 triggers false positive for every single EXE file. It could be acceptable if the rule name would be might_contain_base64_strings, but the whole EXE file is not base64 so this name is missleading.

The matching without any context is slowing down the performance of scanning. Matching of the formating regular expression without any atoms can significantly slow down the scanning - especially if used on folder with bigger files.

The rule is technically not really matching all base64 encoded data, but only data of length 8 bytes or longer - I would recommend to document this fact in the rule description.

description = "This rule matches base64 strings of length of decoded data >= 8 bytes"

The regular expression is lacking the possibility to contain return character at arbitrary position - such as:

$ echo -e "S\nG\nVsb\nG8gd2\n9ybGQK" | base64 -d
Hello world

To match the name of the rule "contentis_base64" I would recommend it should really match whole content from start of the rgular expression on the beginning of the input "^" and match till the end "$":

 $a = /^([A-Za-z0-9+\/\n]{4}){3,}([A-Za-z0-9+\/\n]{2}==|[A-Za-z0-9+\/\n]{3}=)?$/

If matching from beginning to the end it might make a sense to match any base64:

$a = /^([A-Za-z0-9+\/\n]{4}){0,}([A-Za-z0-9+\/\n]{2}==|[A-Za-z0-9+\/\n]{3}=)?$/

Another option would be to leave as it is, but rename rule to something like might_contain_base64 .

Michal Ambroz

========= cut here ====================================================== For example here is list of strings matched in some EXE (will be mostly false positives):

$ grep -a -o -E "([A-Za-z0-9+/]{4}){3,}([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?" example.exe
InitializeCriticalSectio
displacement
CorExitProce
CompareStrin
GetCurrentPackag
LCMapStringE
LocaleNameToLCID
abcdefghijklmnopqrstuvwx
abcdefghijklmnopqrstuvwx
ABCDEFGHIJKLMNOPQRSTUVWX
ABCDEFGHIJKLMNOPQRSTUVWX
NtUnmapViewOfSection
GetProcessHe
GetLastError
GetProcAddre
GetModuleHandleA
CryptAcquireContextW
UnhandledExceptionFilter
SetUnhandledExceptionFil
GetCurrentProces
TerminateProcess
IsProcessorFeaturePresen
QueryPerformanceCoun
GetCurrentProces
GetCurrentThread
GetSystemTimeAsFileT
InitializeSListH
IsDebuggerPresen
GetStartupIn
GetModuleHandleW
RaiseExcepti
SetLastError
EnterCriticalSection
LeaveCriticalSection
DeleteCriticalSectio
InitializeCriticalSectionAndSpinCoun
LoadLibraryE
GetStdHandle
GetModuleFileNam
MultiByteToWideC
WideCharToMultiB
GetModuleHandleE
GetCommandLi
GetCommandLi
FindFirstFileExW
FindNextFile
IsValidCodeP
GetEnvironmentString
FreeEnvironmentStrin
SetEnvironmentVariab
CompareStrin
LCMapStringW
SetStdHandle
GetStringTyp
FlushFileBuffers
GetConsoleCP
GetConsoleMo
SetFilePointerEx
WriteConsole
DecodePointe
abcdefghijklmnopqrstuvwx
ABCDEFGHIJKLMNOPQRSTUVWX
abcdefghijklmnopqrstuvwx
ABCDEFGHIJKLMNOPQRSTUVWX
manifestVersion=
requestedPrivile
requestedExecutionLe
/requestedPrivileges
272H2M2R2s2x
1D1H1L1P1T1X
1d1h1l1p1t1x
4D4H4L4P4T4X
4d4h4l4p4t4x
5D5H5L5P5T5X
60646H6L6P6T
Xumeiquer commented 7 years ago

Hi xambroz,

Let's start by the way Yara search for matches. I recommend you to read the issue 504.

It is true that the rule is slowing down the performance of scanning, but there is no way to know the context where the Base64 code will be found.

It is true that the rule matches >= 8 bytes of Base64 code. It was done in terms to remove false positives regarding the issue 153.

The valid characters for Base64 encoding are: A-Z, a-z, 0-9, +, / and = as padding. Those values are defined at the section 3 of RFC 3548 (https://tools.ietf.org/html/rfc3548#section-3). The character new line \n is not in that set so I do not see the point to put them in the regex.

The meaning of the rule is to match portions of binary files that are Base64 valid so there is no point to search for a Base64 in the whole file, moreover if there is a file witch its content is Base64 the rule will match as well.

Finally, I do not get the point on renaming the rule, if the regex matches a Base64 sequence, it is because there is a Base64 content and not there might be Base64.

On your final thoughts, there is no easy solution. Becasue all of those strings matches the Base64 regex but you know that those string are not used as Base64 (and they could) insted they are used as normal strings.

xambroz commented 7 years ago

Hello Xumeiquer,

It is true that the rule is slowing down the performance of scanning, but there is no way to know the context where the Base64 code will be found.

Comparing the usability of the rules like utils/suspicious_strings.yar:Base64d_PE or malware/MALW_Miscelanea.yar:Base64_encoded_Executable when talking about Base64 I do not know what is really use of having this Crypto/base64.yar in the default index.yar of Yara-Rules. The rule is slow and triggers false positive on any alphanumeric text string longer than 12 bytes - will hit pretty much any file you ever throw at it.

I do not get the point on renaming the rule, if the regex matches a Base64 sequence, it is because there is a Base64 content and not there might be Base64.

This is probably matter of personal taste - I believe the naming of the rule should say what is that rule matching. My understanding of "Content is something" (like Content is PE Executable) ... would be that whole file is "something". In this case whole file is not Base64 encoded. It just might/might not contain some possibly Base64 encoded strings in it.

In Yara-Rules we do not have any other rule saying "contentis_something" to compare the semantics. We have got some "Embedded_something" (like PDF_Embedded_Exe, Embedded_EXECloaking) or "Contains" like (Contains_VBA_macro_code, Contains_VBE_File, Contains_UserForm_Object, Contains_hidden_PE_File_inside_a_sequence_of_numbers).

The character new line \n is not in that set so I do not see the point to put them in the regex. Those values are defined at the section 3 of RFC 3548 (https://tools.ietf.org/html/rfc3548#section-3).

RFC 3548 says the implementation should not produce/rely on line breaks unless specified in some other referencing rfc - then RFC 2045 (MIME) explicitly states it should break lines (at-will) and no longer than 76 characters. https://tools.ietf.org/html/rfc3548#section-2.1 https://tools.ietf.org/html/rfc2045#section-6.8

The Base64 encoding is most often used as part of another encoding like MIME or PEM. The base64 from GNU core-utils, which is de-facto reference implementation is allowing/producing the line-breaks as well (MIME 76 bytes line-wrap by default).

For malware analysis allowing to include '\n' (possibly even '\r') in match for Base64 is essential for example in malware where the ciper material or certificates are commonly encoded as PEM = base64 with arbitrary line breaks. See for example:

$ echo -e "aHR0cDov\nL21hbHdh\ncmUuZG9\n3bmxvYWQ\nuaXQva2l\nsbGVyMDc\nK" > /tmp/test-case1.b64 
# This text is valid base64 linger than 8 characters
$ cat /tmp/test-case1.b64 | base64 -d
http://malware.download.it/killer07
#But this yara rule will miss it
$ yara Crypto/base64.yar /tmp/test-case1.b64
Crypto/base64.yar(14): warning: $a is slowing down scanning (critical!)

same for:

echo -e "U3V\nwZX\nJTZ\nWNy\nZXR\nTdH\nJpb\nmcK" > /tmp/test-case2.b64
echo -e "U\n3\nV\nw\nZ\nX\nJ\nT\nZ\nW\nN\ny\nZ\nX\nR\nTd\nH\nJ\np\nb\nm\nc\nK" /tmp/test-case3.b64

The meaning of the rule is to match portions of binary files that are Base64 valid so there is no point to search for a Base64 in the whole file

If that is the meaning - to match substrings in binary file - then I really believe the rule should be renamed to something like Base64_string or similar. I would also beg to get it out of the default Yara-Rules/index.yar as matching that might give some insight when reversing one particular binary, but really is not of any use triggering every single file when for example running Yara-Rules through the whole directory of files as it triggers huge amount on false positives.

Xumeiquer commented 7 years ago

Crypto/base64.yar has been moved to utils/base64.yar