using yara regex rule to scan chinese character, error

VirusTotal / yara

The pattern matching swiss knife

https://virustotal.github.io/yara/

BSD 3-Clause "New" or "Revised" License

8.08k stars 1.43k forks source link

using yara regex rule to scan chinese character, error #1952

Open hanggao481 opened 1 year ago

hanggao481 commented 1 year ago

How to use yara regex rule to scan chinese character? what's the reason of the following error match?

Describe the bug my yara rule: rule AsianCharacter : general { strings: $chinese = /[\u8fd9]/ condition: $chinese }

match result: 0x1cd:$chinese: u 0x1d2:$chinese: f 0x1dd:$chinese: 8

Expected behavior expecting match result: 0x1cd:$chinese: 这

Note: unicode of "这" is \u8fd9

hanggao481 commented 1 year ago

another example: I want to scan Chinese character by regex yara rules as beloww: rule AsianCharacter : general { strings: $chinese = /[\u4e00-\u9fa5]/ condition: $chinese } Problem: it cannot match Chinese character.

vthib commented 1 year ago

Yara does not have unicode handling in strings, and the \u syntax does not exist. What you wrote is actually [u8fd9], so one of those five ascii bytes.

If you want to search for a non ascii character, you will need to search for the bytes that match its encoding in the files you search. For utf-8 files, that would mean something like this:

rule AsianCharacter : general
{
  strings:
    $chinese = /\xe8\xbf\x99/
  condition:
    $chinese
}

For utf-16 encoding, I guess something like that /\x8f\xd9/.

Note that because you need to encode in a given encoding, you cannot use ranges like in your second example.

gaohang commented 11 months ago

Yara does not have unicode handling in strings, and the \u syntax does not exist. What you wrote is actually [u8fd9], so one of those five ascii bytes.

If you want to search for a non ascii character, you will need to search for the bytes that match its encoding in the files you search. For utf-8 files, that would mean something like this:
rule AsianCharacter : general
{
  strings:
    $chinese = /\xe8\xbf\x99/
  condition:
    $chinese
}
For utf-16 encoding, I guess something like that /\x8f\xd9/.

Note that because you need to encode in a given encoding, you cannot use ranges like in your second example.

Thanks. Is there any way to use yara to match Chinese characters ? It means that a scope of unicode can be a yara regex like general regex, e.g. [\u4e00-\u9fa5].