Various URL extraction issues

rshipp commented 6 years ago

Hold-all issue for invalid URLs I find that come through extraction.

http:// NOTICE
https://redacted.sf-api.eu/</BaseUrl
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):
http://domain rsafinderfirewall[.]com
http://example,\xa0c0pywins.is-not-certified[.]com
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d
http://tintuc[.]vietbaotinmoi[.]com\u201d
espn[.]com.\u201d
http://calendarortodox[.]ro/serstalkerskysbox.png”
tFtp://cFa.tFrFa
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php
http://at\xa0redirect.turself-josented[.]com
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg
hxxp://<redacted>/28022018/pz.zip.\xa0
hxxp:// 23.89.158.69/gtop
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,
hxxp:// feeds.rapidfeeds[.]com/88604/
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019
h00p://119
h00p://218.84
hxxp:// "www.hongcherng.com"/rd/rd
http://http%3a%2f%2f117%2e18%2e232%2e200%2f
http://http%3a%2f%2fgaytoday%2ecom%2f
h00p://http://turbonacho(.)com/ocsr.html"\uff1e

URLs with wildcard/regex:

https://.+\.unionbank\.com/
https://.*citizensbank\.com/
https://(www\.|)svbconnect\.com/
https://(bolb\-(west|east)|www)\.associatedbank\.com/

Extracts part of the match as a second URL:

i[.]memenet[.]org/wfedgl[.]hta -> wfedgl[.]hta
http://196.29.164.27/ntc/ntcblock.html?dpid=1&dpruleid=3&cat=10&ttl=-200&groupname=Canar_staff&policyname=canar_staff_policy&username=[REDACTED]&userip=[REDACTED]&connectionip=127.0.0.1&nsphostname=NSPS01&protocol=policyprocessor&dplanguage=-&url=http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f” -> http%3a%2f%2fwww%2emonacogoldcasino%2ecom%2f

rshipp commented 6 years ago

Some unicode issues, looks like the regex needs tightened:

https://secure.comodo.net/CPS0CU<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+h0f0>+0�2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+0�http://ocsp.comodoca.com0U0�info@all-media.site0
http://crl.comodoca.com/COMODORSACertificationAuthority.crl0q+e0c0;+0�/http://crt.comodoca.com/COMODORSAAddTrustCA.crt0$+0�http://ocsp.comodoca.com0
https://www.digicert.com/CPS0�d+0�V�RAny
http://crl3.digicert.com/DigiCertAssuredIDCA-1.crl08�6�4�2http://crl4.digicert.com/DigiCertAssuredIDCA-1.crl0w+k0i0$+0�http://ocsp.digicert.com0A+0�5http://cacerts.digicert.com/DigiCertAssuredIDCA-1.crt0
http://www.digicert.com/ssl-cps-repository.htm0�d+0�V�RAny
http://ocsp.digicert.com0C+0�7http://cacerts.digicert.com/DigiCertAssuredIDRootCA.crt0��Uz0x0:�8�6�4http://crl3.digicert.com/DigiCertAssuredIDRootCA.crl0:�8�6�4http://crl4.digicert.com/DigiCertAssuredIDRootCA.crl0U+����ߢ�W

rshipp commented 6 years ago

Similar to "Extracts part of the match as a second URL" cases above:

185.189.58[.]222

Extracts as:

http://58.222

rshipp commented 6 years ago

Some more information on some of the bugs we're seeing here:

Actual output	Expected output	Bug description
`http:// NOTICE`	None	Not sure if we can fix this, it does match the regex.
`https://redacted.sf-api.eu/</BaseUrl`	`https://redacted.sf-api.eu/`	See if we can get this working with the existing punctuation filter
`https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please`	`https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f`	Extra cruft after the URL
`http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe):`	`http://rsafinderfirewall[.]com/Es3tC0deR3name.exe`	Unicode space (\xa0) should end the URL; end punctuation not being stripped
`http://domain rsafinderfirewall[.]com`	`http://rsafinderfirewall[.]com`	Unicode space should end the URL
`http://example,\xa0c0pywins.is-not-certified[.]com`	`http://c0pywins.is-not-certified[.]com`	Unicode space should end the URL
`webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt`	`https://a.pomf[.]cat/ntluca.txt`	Junk getting through the bracket regex before the prefix
`http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d`	`HtTP:\\193[.]29[.]187[.]49\qb.doc`	Handle backslashes as a defang/refang; include unicode quote as punctuation in regexes
`http://tintuc[.]vietbaotinmoi[.]com\u201d`	`http://tintuc[.]vietbaotinmoi[.]com`	include unicode quote as punctuation in regexes
`espn[.]com.\u201d`		include unicode quote as punctuation in regexes
`http://calendarortodox[.]ro/serstalkerskysbox.png”`		include unicode quote as punctuation in regexes
`tFtp://cFa.tFrFa`	???	No idea... investigate the source to see what this was supposed to be
`h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm`		This is actually correct, but the refang function needs to handle unicode em-dash.
`hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php`	`hxxp://paclficinsight.com`	Just stop on the \xa0 unicode space
`http://at\xa0redirect.turself-josented[.]com`
`KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe',`
`at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg`
`hxxp://<redacted>/28022018/pz.zip.\xa0`	`hxxp://<redacted>/28022018/pz.zip`	No way to recover the redacted unfortunately... just drop the \xa0 and pass the rest even though this is useless as an IOC
`hxxp:// 23.89.158.69/gtop`		Same \xa0 issue
`h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e`	`h00p://bigdeal.my/gH9BUAPd/js.js`	More unicode regex tightening
`hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/,`		Comma should be stripped
`hxxp:// feeds.rapidfeeds[.]com/88604/`
`hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019`
`h00p://119`		Piece of a IP URL... should probably filter these out somehow, maybe this is solved by whatever solves the "Extracts part of the match as a second URL" cases
`h00p://218.84`
`hxxp:// "www.hongcherng.com"/rd/rd`
`http://http%3a%2f%2f117%2e18%2e232%2e200%2f`		Extra scheme for some reason...
`http://http%3a%2f%2fgaytoday%2ecom%2f`
`h00p://http://turbonacho(.)com/ocsr.html"\uff1e`		Extra scheme and unicode issues

rshipp commented 6 years ago

This is the source of the cFa.tFrFa ioc: https://malware.news/t/technical-teardown-analysing-malspam-attack/11149. There's some obfuscation here that's beyond what we can handle as a defang. I think this one can be ignored. The real indicator is listed later in the post anyway.

DynaMc commented 6 years ago

Hey,

I can answer the question above. This really isn't about an IOC (it is an IOC) but more about obfuscation.

It's an obfuscated url.

‘FhFtFtp://cFa.tFrFadeFlaFtFinosF.Fco/jFsF90F.FbinF?’ = http://ca.tradelatinos.co/js90.bin?

https://www.virustotal.com/#/url/01332b16ae9d3347a2bbffd1a9089542f11a0b02a94c44db62f020fb8ed490a8/details

rshipp commented 6 years ago

Thanks :) Unfortunately the way we're getting this text, it's split up so that we can't regex out the full obfuscated URL:

‘iFlFe(‘FhFtFtp://cFa.tFrFa’ +

‘deFlaFtFinosF.Fco/jF’ +

On top of that, the every-other-character obfuscation is more complicated than the simple defangs this library was meant to cover, so there's no good way to parse it out. That said, the deobfuscated URL is also contained later in the same text, so we do parse that out correctly - we just get an extra false-positive URL coming through as tFtp://cFa.tFrFa that an analyst would have to manually remove/ignore. Not a big issue, just something I noticed while combing through some test data.

rshipp commented 6 years ago

Oh, to clarify, we're not looking at/extracting from the original file here, only the RSS feeds of a bunch of security blogs. That probably wasn't clear at all in the issue context.

DynaMc commented 6 years ago

No problem and agreed, it appears to be outside of the scope of the tool. Good job, I'll use this in the future I'm sure so 😀.

As a side note. If you want some good regex's check out the source code of cyber chef, GCHQs tool. You have many covered already though. I'll contribute where I can.

rshipp commented 6 years ago

Thanks for the tip!

CyberChef regex for future reference: https://github.com/gchq/CyberChef/blob/master/src/core/operations/Extract.js. The IPv6 seems more advanced than ours for sure.

rshipp commented 6 years ago

Closing via #24, which fixes most of the remaining bugs from this issue.

InQuest / iocextract

Various URL extraction issues #6