InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
505 stars 91 forks source link

Failed to parse URL correctly #38

Closed ninoseki closed 4 years ago

ninoseki commented 4 years ago

A URL which is surrounded by Japanese characters is not parsed correctly.

print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']

# My expectation is ['http://example.com']

I'm not sure how to fix it. But I think checking TLD might work well.

cmmorrow commented 4 years ago

Hello @ninoseki, I'll take a look at this and see if I can adjust the regular expression to get this to work.

cmmorrow commented 4 years ago

I think I have a solution. This works:

echo "『http://example.com』インコ\u1f99c" | python iocextract.py
http://example.com