Closed monoidic closed 1 year ago
I had some fun with the encoding issues in AS names in the past already (https://github.com/certtools/intelmq/commit/fff0b6e5f3998ee6621350f270c95e5c3bd15111#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f https://github.com/certtools/intelmq/commit/3f0983dcdca9a87f784195e9cbc71bd053caf764#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f https://github.com/certtools/intelmq/commit/cb4948bab7735fa81a7f76c1982a38cd35228403#diff-016f562585e3b1dd9bbc46a808574681f44f9f1f522890f9a8740182ea61e81f #307) and I would now be in favour of doing .decode(errors='ignore')
to decode everything we can and not fail with the other bogus data like this example.
I discovered a slight issue with the Cymru whois expert bot with some strange AS names. For instance, AS266522.
\226\128\143
, in unicode, decodes to u+200f, the right-to-left mark. When the IMQ bot attempts to decode it withresult['as_name'] = items[4].encode('latin1').decode('utf8')
, it runs into this issue:UnicodeEncodeError: 'latin-1' codec can't encode character '\u200f' in position 42: ordinal not in range(256)
A solution which just permits decoding special symbols like RTL marks might not be the best, though just discarding these events into the dump file is also not a good option.
Various online tools for finding public data on ASes, found online via a quick search, give varying results.
â€
andaâ
)‏
8207
ÂÂ
I suppose there's no existing expectation of invalid UTF-8 sequences or special symbols like RTL marks in AS names, nor any "standard" way of handling them, and I feel iffy about passing arbitrary symbols like that on to other systems.
Are there any ideas for how to handle cases like this? The simplest option would be to simply pass the undecoded string on in case the decoding step fails.