InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
498 stars 91 forks source link

Subdomains and IPs in URLs are not always parsed correctly #29

Closed JayFields closed 5 years ago

JayFields commented 5 years ago

Given defanged URLs with an IP address or a subdomain such as:

hXXps://192.168.149[.]100/api/info hXXps://subdomain.example[.]com/some/path

The GENERIC_URL_RE regex returns the correct results. However, since they are also parsed with the BRACKET_URL_RE regex additional invalid results are also returned:

http://149.100/api/info http://example.com/some/path

A simple change seems to fix the problem--assuming I'm not missing some false positive scenario.

diff --git a/iocextract.py b/iocextract.py
index 8fdb374..dcd25dd 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -66,7 +66,7 @@ GENERIC_URL_RE = re.compile(r"""
 BRACKET_URL_RE = re.compile(r"""
         \b
         (
-            [\:\/\\\w\[\]\(\)-]+
+            [\.\:\/\\\w\[\]\(\)-]+
             (?:
                 \x20?
                 [\(\[]
rshipp commented 5 years ago

Interesting, thanks for the issue and suggested fix! I'll dig into this a little and see if I can get a new version pushed out to fix this and #27 on Friday.

rshipp commented 5 years ago

Alright I wrote some tests for this and your change looks perfect. I'll get it shipped tomorrow. :+1:

JayFields commented 5 years ago

Great, thanks!

rshipp commented 5 years ago

Pushed to PyPI as v1.13.0. Thanks again for the fix.