Closed krispimk closed 6 years ago
Hi, thanks for the issue!
I've left some thoughts below. Feel free to open a PR, and let me know if you have any questions.
I like this change and how you implemented it. I tried to stay away from making decisions on what valid things are "defanged", but this one specifically seems worth an exception. It's probably more likely that [.]
in the path portion of a URL is a defang than a part of the original URL.
If we're adding support for defanged email addresses, we should keep it in line with our other defang support, e.g. copy this segment from the BRACKET
regex in place of your [[]*\.[]]*[a-zA-Z0-9-.]+
:
(?:
\x20?
[\(\[]
\x20?
\.
\x20?
[\]\)]
\x20?
\S*?
)+
which will let us match domains like example[.]com
, example (.] com
, etc. We can tighten it a bit by changing \S
to [A-Za-z0-9-]
since we don't care about paths. You'll need to add the re.VERBOSE
flag to use the multiline regex.
For your [\s]*@[\s]*
portion, can you change that to be a little stricter? Something like \x20?@\x20?
maybe? Unless you're seeing defangs with tabs and/or multiple spaces.
I'd like to see this change implemented a little differently. I think you should be able to call _refang_common(email.group(0))
- if that doesn't work, let me know. You'll also want to add a refang=False
optarg to the extract_emails
function and an if/else to decide whether to refang - in the same way the extract_ipv4s
function does, for example. Finally, be sure to modify extract_iocs
to pass in refang=refang
to the email function.
Would you like to make those changes yourself and push it? The email part will be awesome to have.
Sure, I can probably get it in the next week or so if you don't want to.
no worries, I made some changes, want to double check it for me?
diff --git a/iocextract.py b/iocextract.py
index 814ad8a..36e6de0 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -21,6 +21,25 @@ except ImportError:
import ipaddress
+BRACKET_EMAIL_RE = re.compile(r"""
+ \b
+ (
+ [\w]+[\s]*@[\s]*[\w]+
+ (?:
+ \x20?
+ [\(\[]
+ \x20?
+ \.
+ \x20?
+ [\]\)]
+ \x20?
+ \S*?
+ )+
+ )
+ [\.\?>\"'\)!,}:;\]]*
+ (?=\s|$)
+ """, re.VERBOSE)
+
# Get basic url format, including a few obfuscation techniques, main anchor is the uri scheme
GENERIC_URL_RE = re.compile(r"""
(
@@ -124,7 +143,6 @@ IPV6_RE = re.compile(r"""
\b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
""", re.IGNORECASE | re.VERBOSE)
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -158,13 +176,13 @@ def extract_iocs(data, refang=False, strip=False):
:param bool strip: Strip possible garbage from the end of URLs
:rtype: :py:func:`itertools.chain`
"""
- return itertools.chain(
+ return set(itertools.chain(
extract_urls(data, refang=refang, strip=strip),
extract_ips(data, refang=refang),
- extract_emails(data),
+ extract_emails(data, refang=refang),
extract_hashes(data),
extract_yara_rules(data)
- )
+ ))
def extract_urls(data, refang=False, strip=False):
"""Extract URLs.
@@ -174,6 +192,7 @@ def extract_urls(data, refang=False, strip=False):
:param bool strip: Strip possible garbage from the end of URLs
:rtype: Iterator[:class:`str`]
"""
+
unencoded_urls = itertools.chain(
GENERIC_URL_RE.finditer(data),
BRACKET_URL_RE.finditer(data),
@@ -191,12 +210,14 @@ def extract_urls(data, refang=False, strip=False):
yield url
for url in HEXENCODED_URL_RE.finditer(data):
+ print ".....", url
if refang:
yield binascii.unhexlify(url.group(1)).decode('utf-8')
else:
yield url.group(1)
for url in URLENCODED_URL_RE.finditer(data):
+ print "~~~~~~", url
if refang:
yield unquote(url.group(1))
else:
@@ -240,14 +261,25 @@ def extract_ipv6s(data):
for ip_address in IPV6_RE.finditer(data):
yield ip_address.group(0)
-def extract_emails(data):
+def extract_emails(data, refang=False, strip=False):
"""Extract email addresses
:param data: Input text
:rtype: Iterator[:class:`str`]
"""
- for email in EMAIL_RE.finditer(data):
- yield email.group(0)
+
+
+ unencoded_emails = itertools.chain(
+ BRACKET_EMAIL_RE.finditer(data),
+ )
+
+ for email in unencoded_emails:
+ if refang:
+ email = _refang_common(email.group(0))
+ else:
+ email = email.group(1)
+
+ yield email
def extract_hashes(data):
"""Extract MD5/SHA hashes.
@@ -416,6 +448,7 @@ def refang_url(url):
# Remove artifacts from common defangs.
parsed = parsed._replace(netloc=_refang_common(parsed.netloc))
+ parsed = parsed._replace(path=_refang_common(parsed.path))
# Fix example[.]com, but keep RFC 2732 URLs intact.
if not _is_ipv6_url(url):
[iocextract copy.txt](https://github.com/InQuest/python-iocextract/files/2312069/iocextract.copy.txt)
+ return set(itertools.chain(
Don't use set
here, that will unwrap the generator immediately and lose all the performance benefits.
+ unencoded_emails = itertools.chain(
+ BRACKET_EMAIL_RE.finditer(data),
+ )
+
+ for email in unencoded_emails:
Can you shorten this to for email in BRACKET_EMAIL_RE.finditer(data)
? No need for itertools.chain
here since we only have one iterator.
+ else:
+ email = email.group(1)
I think this should still be email.group(0)
, not 1
, unless I'm missing something.
Btw the markdown syntax for code blocks is this, the single backticks are why your comments are showing up all wonky :smile: :
```
my code
```
It's easier to review and suggest changes on a PR too, if you're comfortable with that - otherwise feel free to continue commenting here, it works fine.
Ok made those changes. Want me to commit the code? iocextract copy.txt
diff --git a/iocextract.py b/iocextract.py
index 814ad8a..a07ba6a 100644
--- a/iocextract.py
+++ b/iocextract.py
@@ -21,6 +21,25 @@ except ImportError:
import ipaddress
+BRACKET_EMAIL_RE = re.compile(r"""
+ \b
+ (
+ [\w]+[\s]*@[\s]*[\w]+
+ (?:
+ \x20?
+ [\(\[]
+ \x20?
+ \.
+ \x20?
+ [\]\)]
+ \x20?
+ \S*?
+ )+
+ )
+ [\.\?>\"'\)!,}:;\]]*
+ (?=\s|$)
+ """, re.VERBOSE)
+
# Get basic url format, including a few obfuscation techniques, main anchor is the uri scheme
GENERIC_URL_RE = re.compile(r"""
(
@@ -124,7 +143,6 @@ IPV6_RE = re.compile(r"""
\b(?:[a-f0-9]{1,4}:|:){2,7}(?:[a-f0-9]{1,4}|:)\b
""", re.IGNORECASE | re.VERBOSE)
-EMAIL_RE = re.compile(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)")
MD5_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{32})(?:[^a-fA-F\d]|\b)")
SHA1_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{40})(?:[^a-fA-F\d]|\b)")
SHA256_RE = re.compile(r"(?:[^a-fA-F\d]|\b)([a-fA-F\d]{64})(?:[^a-fA-F\d]|\b)")
@@ -161,7 +179,7 @@ def extract_iocs(data, refang=False, strip=False):
return itertools.chain(
extract_urls(data, refang=refang, strip=strip),
extract_ips(data, refang=refang),
- extract_emails(data),
+ extract_emails(data, refang=refang),
extract_hashes(data),
extract_yara_rules(data)
)
@@ -174,6 +192,7 @@ def extract_urls(data, refang=False, strip=False):
:param bool strip: Strip possible garbage from the end of URLs
:rtype: Iterator[:class:`str`]
"""
+
unencoded_urls = itertools.chain(
GENERIC_URL_RE.finditer(data),
BRACKET_URL_RE.finditer(data),
@@ -191,12 +210,14 @@ def extract_urls(data, refang=False, strip=False):
yield url
for url in HEXENCODED_URL_RE.finditer(data):
+ print ".....", url
if refang:
yield binascii.unhexlify(url.group(1)).decode('utf-8')
else:
yield url.group(1)
for url in URLENCODED_URL_RE.finditer(data):
+ print "~~~~~~", url
if refang:
yield unquote(url.group(1))
else:
@@ -240,14 +261,20 @@ def extract_ipv6s(data):
for ip_address in IPV6_RE.finditer(data):
yield ip_address.group(0)
-def extract_emails(data):
+def extract_emails(data, refang=False, strip=False):
"""Extract email addresses
:param data: Input text
:rtype: Iterator[:class:`str`]
"""
- for email in EMAIL_RE.finditer(data):
- yield email.group(0)
+
+ for email in BRACKET_EMAIL_RE.finditer(data):
+ if refang:
+ email = _refang_common(email.group(0))
+ else:
+ email = email.group(0)
+
+ yield email
def extract_hashes(data):
"""Extract MD5/SHA hashes.
@@ -416,6 +443,7 @@ def refang_url(url):
# Remove artifacts from common defangs.
parsed = parsed._replace(netloc=_refang_common(parsed.netloc))
+ parsed = parsed._replace(path=_refang_common(parsed.path))
# Fix example[.]com, but keep RFC 2732 URLs intact.
if not _is_ipv6_url(url):
👍 That looks good. Once you submit I'll write some unit tests to double check everything, update the docs and get this pushed out. Thanks!
I'm getting a 403 on push
fatal: unable to access 'https://github.com/InQuest/python-iocextract.git/': The requested URL returned error: 403
Sounds like you just have a local clone of this repo. You'll want to:
git remote set-url origin https://github.com/mokarimi/python-iocextract.git
)git push -u origin master
)Just pushed v1.7.0 to PyPI with your changes. You should be able to upgrade with:
pip install -U iocextract
Thanks for your work! Feel free to open another issue if you notice anything wrong. We'll announce this release on Twitter later, do you have an account there you'd like us to mention?
Also, if you add your committer email to your GitHub account, you should show up here: https://github.com/InQuest/python-iocextract/graphs/contributors. It's just not showing now because that email isn't linked to your account.
FYI,
Just tied an email that looked like this and it didn't work:
office-account-team-security-account.live.com-noreply- @ Microsoft[.]com
With the spaces around the @
symbol? We'll need to add support for that, it currently only supports spaces around the [.]
symbols.
yeah, there are spaces around it.
Alright thanks for the report, I'll get a fix for that up asap.
Pushed this as 1.7.2. Let me know if you find anything else!
I noticed that if the URL was something like this: hxxps://momorfheinz[.]usa[.]cc/login[.]microsoftonline[.]com then it would only defang that it only fixed the netloc portion of the URL. Also, made a change to the email regex.
What do you think?