PaloAltoNetworks / minemeld-core

Engine of MineMeld
Apache License 2.0
141 stars 95 forks source link

Proposal: Support feeds using other encodings than UTF-8 #256

Closed kidmose closed 6 years ago

kidmose commented 6 years ago

I've encountered at feed that triggers the following error:

/opt/minemeld/engine/0.9.44/local/lib/python2.7/site-packages/minemeld/ft/basepoller.py:510: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if oa.get(k, None) != na[k]:
2018-01-17T14:37:00 (305)basepoller._poll ERROR: Exception in polling loop for malwaredomainlist_com: Overlong 2 byte UTF-8 sequence detected when encoding string
Traceback (most recent call last):
  File "/opt/minemeld/engine/0.9.44/local/lib/python2.7/site-packages/minemeld/ft/basepoller.py", line 721, in _poll
    performed = self._polling_loop()
  File "/opt/minemeld/engine/0.9.44/local/lib/python2.7/site-packages/minemeld/ft/basepoller.py", line 648, in _polling_loop
    self.table.put(indicator, v)
  File "/opt/minemeld/engine/0.9.44/local/lib/python2.7/site-packages/minemeld/ft/basepoller.py", line 113, in put
    return self.table.put(indicator, value)
  File "/opt/minemeld/engine/0.9.44/local/lib/python2.7/site-packages/minemeld/ft/table.py", line 318, in put
    batch.put(ikey, struct.pack(">Q", cversion)+ujson.dumps(value))
OverflowError: Overlong 2 byte UTF-8 sequence detected when encoding string

A simple, approximate reproduction is as follows:

#!/usr/bin/python
import ujson as json
l = '\xc1'
print(json.dumps(l))

Which is solved by recoding to UTF8, which appears to be hardcoded in Minemeld:

#!/usr/bin/python
import ujson as json
l = '\xc1'
print(json.dumps(l.decode('latin_1').encode('utf_8')))

I suggest the basepoller is modified to support other encodings for feeds, in order to recode to UTF8, which appears to be used withing minemeld.

Does this make sense?