PUNCH-Cyber / stoq-plugins-public

stoQ Public Plugins
https://stoq.punchcyber.com
Apache License 2.0
72 stars 24 forks source link

IOC Extract - Add error handling to decoding #91

Closed malvidin closed 4 years ago

malvidin commented 4 years ago

If an invalid character is found, the decoding fails. See issue https://github.com/PUNCH-Cyber/stoq-plugins-public/issues/90

mlaferrera commented 4 years ago

I'm actually not a fan of handling decode issues like this due to characters being removed that would otherwise be valid. Can you update to use beautifulsoup4's UnicodeDammit function instead? Basically it would just be replaced with UnicodeDammit(payload.content).unicode_markup.

malvidin commented 4 years ago

If so, I recommend adding cchardet to the requirements of the stoq framework.

# With cchardet
>>> from timeit import timeit
>>> timeit(stmt='UnicodeDammit(os.urandom(1000000)).unicode_markup', setup='import os; from bs4 import UnicodeDammit', number=10)
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
9.779683475004276

# With chardet
>>> timeit(stmt='UnicodeDammit(os.urandom(1000000)).unicode_markup', setup='import os; from bs4 import UnicodeDammit', number=10)
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
223.1361071279971

# decode
>>> timeit(stmt='os.urandom(1000000).decode(errors="replace")', setup='import os', number=10)
0.17595233899191953
mlaferrera commented 4 years ago

Thanks, @malvidin! Can you also bump the version number to 3.0.2? We can also make a note in the README that recommends installing cchardet for performance improvements.

malvidin commented 4 years ago

Bumped the version, and added the performance notes for the plugins that use UnicodeDammit.