URL path defang and Email extraction

krispimk commented 6 years ago

I noticed that if the URL was something like this: hxxps://momorfheinz[.]usa[.]cc/login[.]microsoftonline[.]com then it would only defang that it only fixed the netloc portion of the URL. Also, made a change to the email regex.

What do you think?

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..fc2d80b 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -124,7 +124,7 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)

-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
+EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+[\s]*@[\s]*[a-zA-Z0-9-]+[[]*\.[]]*[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -247,7 +247,7 @@ def extract_emails(data):
     :rtype: Iterator[:class:`str`]
     """
     for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+        yield email.group(0).replace(" ", "").replace("[.]", ".")

 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -420,6 +420,7 @@ def refang_url(url):
     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):
         parsed = parsed._replace(netloc=parsed.netloc.replace('[', '').replace(']', ''))
+        parsed = parsed._replace(path=parsed.path.replace('[.]', '.'))

     return parsed.geturl()

rshipp commented 6 years ago

Hi, thanks for the issue!

I've left some thoughts below. Feel free to open a PR, and let me know if you have any questions.

URL path defang

I like this change and how you implemented it. I tried to stay away from making decisions on what valid things are "defanged", but this one specifically seems worth an exception. It's probably more likely that [.] in the path portion of a URL is a defang than a part of the original URL.

Email regex

If we're adding support for defanged email addresses, we should keep it in line with our other defang support, e.g. copy this segment from the BRACKET regex in place of your [[]*\.[]]*[a-zA-Z0-9-.]+:

which will let us match domains like example[.]com, example (.] com, etc. We can tighten it a bit by changing \S to [A-Za-z0-9-] since we don't care about paths. You'll need to add the re.VERBOSE flag to use the multiline regex.

For your [\s]*@[\s]* portion, can you change that to be a little stricter? Something like \x20?@\x20? maybe? Unless you're seeing defangs with tabs and/or multiple spaces.

Email refang

I'd like to see this change implemented a little differently. I think you should be able to call _refang_common(email.group(0)) - if that doesn't work, let me know. You'll also want to add a refang=False optarg to the extract_emails function and an if/else to decide whether to refang - in the same way the extract_ipv4s function does, for example. Finally, be sure to modify extract_iocs to pass in refang=refang to the email function.

krispimk commented 6 years ago

Would you like to make those changes yourself and push it? The email part will be awesome to have.

rshipp commented 6 years ago

Sure, I can probably get it in the next week or so if you don't want to.

krispimk commented 6 years ago

no worries, I made some changes, want to double check it for me?

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..36e6de0 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -21,6 +21,25 @@ except ImportError:

 import ipaddress

+BRACKET_EMAIL_RE = re.compile(r"""
+        \b
+        (
+            [\w]+[\s]*@[\s]*[\w]+
+            (?:
+                \x20?
+                [\(\[]
+                \x20?
+                \.
+                \x20?
+                [\]\)]
+                \x20?
+                \S*?
+            )+
+        )
+        [\.\?>\"'\)!,}:;\]]*
+        (?=\s|$)
+    """, re.VERBOSE)
+
 # Get basic url format, including a few obfuscation techniques, main anchor is the uri scheme
 GENERIC_URL_RE = re.compile(r"""
         (
@@ -124,7 +143,6 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)

-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -158,13 +176,13 @@ def extract_iocs(data, refang=False, strip=False):
     :param bool strip: Strip possible garbage from the end of URLs
     :rtype: :py:func:`itertools.chain`
     """
-    return itertools.chain(
+    return set(itertools.chain(
         extract_urls(data, refang=refang, strip=strip),
         extract_ips(data, refang=refang),
-        extract_emails(data),
+        extract_emails(data, refang=refang),
         extract_hashes(data),
         extract_yara_rules(data)
-    )
+    ))

 def extract_urls(data, refang=False, strip=False):
     """Extract URLs.
@@ -174,6 +192,7 @@ def extract_urls(data, refang=False, strip=False):
     :param bool strip: Strip possible garbage from the end of URLs
     :rtype: Iterator[:class:`str`]
     """
+
     unencoded_urls = itertools.chain(
         GENERIC_URL_RE.finditer(data),
         BRACKET_URL_RE.finditer(data),
@@ -191,12 +210,14 @@ def extract_urls(data, refang=False, strip=False):
         yield url

     for url in HEXENCODED_URL_RE.finditer(data):
+        print ".....", url
         if refang:
             yield binascii.unhexlify(url.group(1)).decode('utf-8')
         else:
             yield url.group(1)

     for url in URLENCODED_URL_RE.finditer(data):
+        print "~~~~~~", url
         if refang:
             yield unquote(url.group(1))
         else:
@@ -240,14 +261,25 @@ def extract_ipv6s(data):
     for ip_address in IPV6_RE.finditer(data):
         yield ip_address.group(0)

-def extract_emails(data):
+def extract_emails(data, refang=False, strip=False):
     """Extract email addresses

     :param data: Input text
     :rtype: Iterator[:class:`str`]
     """
-    for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+
+
+    unencoded_emails = itertools.chain(
+        BRACKET_EMAIL_RE.finditer(data),
+    )
+
+    for email in unencoded_emails:
+        if refang:
+            email = _refang_common(email.group(0))
+        else:
+            email = email.group(1)
+
+        yield email

 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -416,6 +448,7 @@ def refang_url(url):

     # Remove artifacts from common defangs.
     parsed = parsed._replace(netloc=_refang_common(parsed.netloc))
+    parsed = parsed._replace(path=_refang_common(parsed.path))

     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):
[iocextract copy.txt](https://github.com/InQuest/python-iocextract/files/2312069/iocextract.copy.txt)

rshipp commented 6 years ago

+    return set(itertools.chain(

Don't use set here, that will unwrap the generator immediately and lose all the performance benefits.

+    unencoded_emails = itertools.chain(
+        BRACKET_EMAIL_RE.finditer(data),
+    )
+
+    for email in unencoded_emails:

Can you shorten this to for email in BRACKET_EMAIL_RE.finditer(data)? No need for itertools.chain here since we only have one iterator.

+        else:
+            email = email.group(1)

I think this should still be email.group(0), not 1, unless I'm missing something.

Btw the markdown syntax for code blocks is this, the single backticks are why your comments are showing up all wonky :smile: :

```
my code
```

It's easier to review and suggest changes on a PR too, if you're comfortable with that - otherwise feel free to continue commenting here, it works fine.

krispimk commented 6 years ago

Ok made those changes. Want me to commit the code? iocextract copy.txt

diff --git a/iocextract.py b/iocextract.py
index 814ad8a..a07ba6a 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -21,6 +21,25 @@ except ImportError:

 import ipaddress

+BRACKET_EMAIL_RE = re.compile(r"""
+        \b
+        (
+            [\w]+[\s]*@[\s]*[\w]+
+            (?:
+                \x20?
+                [\(\[]
+                \x20?
+                \.
+                \x20?
+                [\]\)]
+                \x20?
+                \S*?
+            )+
+        )
+        [\.\?>\"'\)!,}:;\]]*
+        (?=\s|$)
+    """, re.VERBOSE)
+
 # Get basic url format, including a few obfuscation techniques, main anchor is the uri scheme
 GENERIC_URL_RE = re.compile(r"""
         (
@@ -124,7 +143,6 @@ IPV6_RE = re.compile(r"""
         \b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
     """, re.IGNORECASE | re.VERBOSE)

-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
 MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
 SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
 SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -161,7 +179,7 @@ def extract_iocs(data, refang=False, strip=False):
     return itertools.chain(
         extract_urls(data, refang=refang, strip=strip),
         extract_ips(data, refang=refang),
-        extract_emails(data),
+        extract_emails(data, refang=refang),
         extract_hashes(data),
         extract_yara_rules(data)
     )
@@ -174,6 +192,7 @@ def extract_urls(data, refang=False, strip=False):
     :param bool strip: Strip possible garbage from the end of URLs
     :rtype: Iterator[:class:`str`]
     """
+
     unencoded_urls = itertools.chain(
         GENERIC_URL_RE.finditer(data),
         BRACKET_URL_RE.finditer(data),
@@ -191,12 +210,14 @@ def extract_urls(data, refang=False, strip=False):
         yield url

     for url in HEXENCODED_URL_RE.finditer(data):
+        print ".....", url
         if refang:
             yield binascii.unhexlify(url.group(1)).decode('utf-8')
         else:
             yield url.group(1)

     for url in URLENCODED_URL_RE.finditer(data):
+        print "~~~~~~", url
         if refang:
             yield unquote(url.group(1))
         else:
@@ -240,14 +261,20 @@ def extract_ipv6s(data):
     for ip_address in IPV6_RE.finditer(data):
         yield ip_address.group(0)

-def extract_emails(data):
+def extract_emails(data, refang=False, strip=False):
     """Extract email addresses

     :param data: Input text
     :rtype: Iterator[:class:`str`]
     """
-    for email in EMAIL_RE.finditer(data):
-        yield email.group(0)
+
+    for email in BRACKET_EMAIL_RE.finditer(data):
+        if refang:
+            email = _refang_common(email.group(0))
+        else:
+            email = email.group(0)
+
+        yield email

 def extract_hashes(data):
     """Extract MD5/SHA hashes.
@@ -416,6 +443,7 @@ def refang_url(url):

     # Remove artifacts from common defangs.
     parsed = parsed._replace(netloc=_refang_common(parsed.netloc))
+    parsed = parsed._replace(path=_refang_common(parsed.path))

     # Fix example[.]com, but keep RFC 2732 URLs intact.
     if not _is_ipv6_url(url):

rshipp commented 6 years ago

👍 That looks good. Once you submit I'll write some unit tests to double check everything, update the docs and get this pushed out. Thanks!

krispimk commented 6 years ago

I'm getting a 403 on push

fatal: unable to access 'https://github.com/InQuest/python-iocextract.git/': The requested URL returned error: 403

rshipp commented 6 years ago

Sounds like you just have a local clone of this repo. You'll want to:

Fork python-iocextract to your own account
Update your clone's remote to use your fork (git remote set-url origin https://github.com/mokarimi/python-iocextract.git)
Push your changes to your fork (git push -u origin master)
Open a pull request to this repo.

rshipp commented 6 years ago

Just pushed v1.7.0 to PyPI with your changes. You should be able to upgrade with:

pip install -U iocextract

Thanks for your work! Feel free to open another issue if you notice anything wrong. We'll announce this release on Twitter later, do you have an account there you'd like us to mention?

Also, if you add your committer email to your GitHub account, you should show up here: https://github.com/InQuest/python-iocextract/graphs/contributors. It's just not showing now because that email isn't linked to your account.

krispimk commented 6 years ago

FYI,

Just tied an email that looked like this and it didn't work:

office-account-team-security-account.live.com-noreply- @ Microsoft[.]com

rshipp commented 6 years ago

With the spaces around the @ symbol? We'll need to add support for that, it currently only supports spaces around the [.] symbols.

krispimk commented 6 years ago

yeah, there are spaces around it.

rshipp commented 6 years ago

Alright thanks for the report, I'll get a fix for that up asap.

rshipp commented 6 years ago

Pushed this as 1.7.2. Let me know if you find anything else!

InQuest / iocextract