InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
505 stars 91 forks source link

Various URL extraction issues #6

Closed rshipp closed 6 years ago

rshipp commented 6 years ago

Hold-all issue for invalid URLs I find that come through extraction.

http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e

URLs with wildcard/regex:

https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/

Extracts part of the match as a second URL:

i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f
rshipp commented 6 years ago

Some unicode issues, looks like the regex needs tightened:

https://secure.comodo.net/CPS0CU<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+h0f0>+0�2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+0�http://ocsp.comodoca.com0U0�info@all-media.site0
http://crl.comodoca.com/COMODORSACertificationAuthority.crl0q+e0c0;+0�/http://crt.comodoca.com/COMODORSAAddTrustCA.crt0$+0�http://ocsp.comodoca.com0
https://www.digicert.com/CPS0�d+0�V�RAny
http://crl3.digicert.com/DigiCertAssuredIDCA-1.crl08�6�4�2http://crl4.digicert.com/DigiCertAssuredIDCA-1.crl0w+k0i0$+0�http://ocsp.digicert.com0A+0�5http://cacerts.digicert.com/DigiCertAssuredIDCA-1.crt0
http://www.digicert.com/ssl-cps-repository.htm0�d+0�V�RAny
http://ocsp.digicert.com0C+0�7http://cacerts.digicert.com/DigiCertAssuredIDRootCA.crt0��Uz0x0:�8�6�4http://crl3.digicert.com/DigiCertAssuredIDRootCA.crl0:�8�6�4http://crl4.digicert.com/DigiCertAssuredIDRootCA.crl0U+����ߢ�W
rshipp commented 6 years ago

Similar to "Extracts part of the match as a second URL" cases above:

Extracts as:

rshipp commented 6 years ago

Some more information on some of the bugs we're seeing here:

Actual output Expected output Bug description
http:// NOTICE None Not sure if we can fix this, it does match the regex.
https://redacted.sf-api.eu/</BaseUrl https://redacted.sf-api.eu/ See if we can get this working with the existing punctuation filter
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f Extra cruft after the URL
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe): http://rsafinderfirewall[.]com/Es3tC0deR3name.exe Unicode space (\xa0) should end the URL; end punctuation not being stripped
http://domain rsafinderfirewall[.]com http://rsafinderfirewall[.]com Unicode space should end the URL
http://example,\xa0c0pywins.is-not-certified[.]com http://c0pywins.is-not-certified[.]com Unicode space should end the URL
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt https://a.pomf[.]cat/ntluca.txt Junk getting through the bracket regex before the prefix
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d HtTP:\\193[.]29[.]187[.]49\qb.doc Handle backslashes as a defang/refang; include unicode quote as punctuation in regexes
http://tintuc[.]vietbaotinmoi[.]com\u201d http://tintuc[.]vietbaotinmoi[.]com include unicode quote as punctuation in regexes
espn[.]com.\u201d include unicode quote as punctuation in regexes
http://calendarortodox[.]ro/serstalkerskysbox.png” include unicode quote as punctuation in regexes
tFtp://cFa.tFrFa ??? No idea... investigate the source to see what this was supposed to be
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm This is actually correct, but the refang function needs to handle unicode em-dash.
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php hxxp://paclficinsight.com Just stop on the \xa0 unicode space
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0 hxxp://<redacted>/28022018/pz.zip No way to recover the redacted unfortunately... just drop the \xa0 and pass the rest even though this is useless as an IOC
hxxp:// 23.89.158.69/gtop Same \xa0 issue
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e h00p://bigdeal.my/gH9BUAPd/js.js More unicode regex tightening
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/, Comma should be stripped
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119 Piece of a IP URL... should probably filter these out somehow, maybe this is solved by whatever solves the "Extracts part of the match as a second URL" cases
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f Extra scheme for some reason...
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e Extra scheme and unicode issues
rshipp commented 6 years ago

This is the source of the cFa.tFrFa ioc: https://malware.news/t/technical-teardown-analysing-malspam-attack/11149. There's some obfuscation here that's beyond what we can handle as a defang. I think this one can be ignored. The real indicator is listed later in the post anyway.

DynaMc commented 6 years ago

Hey,

I can answer the question above. This really isn't about an IOC (it is an IOC) but more about obfuscation.

It's an obfuscated url.

‘FhFtFtp://cFa.tFrFadeFlaFtFinosF.Fco/jFsF90F.FbinF?’ = http://ca.tradelatinos.co/js90.bin?

https://www.virustotal.com/#/url/01332b16ae9d3347a2bbffd1a9089542f11a0b02a94c44db62f020fb8ed490a8/details

rshipp commented 6 years ago

Thanks :) Unfortunately the way we're getting this text, it's split up so that we can't regex out the full obfuscated URL:

‘iFlFe(‘FhFtFtp://cFa.tFrFa’ +

‘deFlaFtFinosF.Fco/jF’ +

On top of that, the every-other-character obfuscation is more complicated than the simple defangs this library was meant to cover, so there's no good way to parse it out. That said, the deobfuscated URL is also contained later in the same text, so we do parse that out correctly - we just get an extra false-positive URL coming through as tFtp://cFa.tFrFa that an analyst would have to manually remove/ignore. Not a big issue, just something I noticed while combing through some test data.

rshipp commented 6 years ago

Oh, to clarify, we're not looking at/extracting from the original file here, only the RSS feeds of a bunch of security blogs. That probably wasn't clear at all in the issue context.

DynaMc commented 6 years ago

No problem and agreed, it appears to be outside of the scope of the tool. Good job, I'll use this in the future I'm sure so 😀.

As a side note. If you want some good regex's check out the source code of cyber chef, GCHQs tool. You have many covered already though. I'll contribute where I can.

rshipp commented 6 years ago

Thanks for the tip!

CyberChef regex for future reference: https://github.com/gchq/CyberChef/blob/master/src/core/operations/Extract.js. The IPv6 seems more advanced than ours for sure.

rshipp commented 6 years ago

Closing via #24, which fixes most of the remaining bugs from this issue.