Handle invalid unicode in metadata values.

wxsBSD commented 4 years ago

In #135 it was brought up that you can crash the python interpreter if you have invalid unicode in a metadata value. This is my attempt to fix that by attempting to create a string, and if that fails falling back to a bytes object. On the weird chance that the bytes object fails to create I added a safety check so that we don't add a NULL ptr to the dictionary (this is how the crash was manifesting).

It's debatable if we want to ONLY add strings as metadata, and NOT fallback to bytes. If we don't fall back to bytes the only other option I see is to silently drop that metadata on the floor. The tradeoff here is that now you may end up with a string or a bytes object in your metadata dictionary, which is less than ideal IMO.

I'm open to suggestions on this one.

Fixes #135

wxsBSD commented 4 years ago

As I said in the commit, I'm not happy with either option. I don't like silently dropping metadata that is not valid unicode, and having it be either bytes or strings is less than ideal also.

wxs@wxs-mbp yara-python % cat test.py
import yara

rules = yara.compile(source=r'rule test { meta: a = "\x80" b = "foo" condition: true }')
print(list(rules)[0].meta["a"])

matches = rules.match("/bin/ls")
print(list(matches)[0].meta)
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.14-x86_64-3.7 python3 test.py
b'\x80'
{'a': b'\x80', 'b': 'foo'}
wxs@wxs-mbp yara-python %

malvidin commented 4 years ago

The PyUnicode_DecodeUTF8 and other PyUnicode_Decode* APIs can handle the errors with backslashdecode, but those require a length.

I don't know the length is available, could be made available in the future, or if the result of the PyBytes_FromString has a length that could then be passed to PyUnicode_FromEncodedObject with backslashdecode error handling.

At this point, the way that YARA processes metadata strings makes "\x20" and " " identical, yara-python doesn't see the strings until after the escaping happens. In the long term, I would hope that YARA won't convert metadata strings like "\x00" and "\x80" into bytes, and accepts all other bytes except null. But that only decreases the likelihood of the issue, it doesn't not remove it.

malvidin commented 4 years ago

What about replacing this: object = PyBytes_FromString(meta->string); with this: object = PyUnicode_DecodeUTF8(meta->string, strlen(meta->string), "replace");

Using replace loses the invalid characters entirely, but the output is simple to handle. If surrogateescape is used instead of replace, the original bytes are available if desired, but printing requires setting the the IO encoding error handling. Another option is ignore, which removes all invalid characters.

With replace: These commands will return the original data, with invalid bytes replaced with �. print(list(rules)[0].meta['a']) # '�' print(list(rules)[0].meta['a'].encode('utf8')) # b'\xef\xbf\xbd'

With surrogateescape: The second command will return the original UTF8, with invalid bytes intact. print(repr(list(rules)[0].meta['a'])) # '\udc80' print(list(rules)[0].meta['a'].encode('utf8', 'surrogateescape')) # b'\x80'

This raises a UnicodeEncodeError, unless the error handling was changed with something like PYTHONIOENCODING=utf-8:surrogateescape. print(list(rules)[0].meta['a']) UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

With any of these, Python doesn't crash.

plusvic commented 4 years ago

I would say that metadata should be only text, at least that was the original intention.

malvidin commented 4 years ago

Based on that intention, I recommend dropping the invalid bytes using ignore.

malvidin commented 4 years ago

For the metadata decoding, is there a drawback to defining PY_STRING as PyUnicode_DecodeUTF8, rather than PyUnicode_FromString? Or is it better to use PyUnicode_FromString and handle any errors with PyUnicode_DecodeUTF8?

For the tests, I recommend accepting an empty string '' or the original string r'\x80'. The first accepts current behavior when YARA unescapes the original string and decoding issue is handled. The second accepts the original string if YARA provides that in the future, which matches the original intention of metadata.

https://github.com/wxsBSD/yara-python/pull/1/commits/c9f4260007a00182d7461c410d0f9406beb1e035

googlebot commented 4 years ago

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

wxsBSD commented 4 years ago

OK, I've updated this with your commit @malvidin - but I modified the tests in your commit to not include the check for returning the raw string. It seems wrong to me to include a test which is not testing the current expected behavior, but is testing something YARA may do in the future (as unlikely as it is to do that).

I'm still not happy with ignoring bytes from strings which do not decode as UTF8, but as you've pointed out there is no good way to handle this. I'm going to include another commit which updates the documentation to say strings in metadata must be valid UTF8 codepoints.

wxsBSD commented 4 years ago

I put up https://github.com/VirusTotal/yara/pull/1260 as a documentation update to go along with this PR if it is merged.

VirusTotal / yara-python

Handle invalid unicode in metadata values. #136