Closed rshipp closed 6 years ago
Some unicode issues, looks like the regex needs tightened:
https://secure.comodo.net/CPS0CU<0:08�6�4�2http://crl.comodoca.com/COMODORSACodeSigningCA.crl0t+h0f0>+0�2http://crt.comodoca.com/COMODORSACodeSigningCA.crt0$+0�http://ocsp.comodoca.com0U0�info@all-media.site0
http://crl.comodoca.com/COMODORSACertificationAuthority.crl0q+e0c0;+0�/http://crt.comodoca.com/COMODORSAAddTrustCA.crt0$+0�http://ocsp.comodoca.com0
https://www.digicert.com/CPS0�d+0�V�RAny
http://crl3.digicert.com/DigiCertAssuredIDCA-1.crl08�6�4�2http://crl4.digicert.com/DigiCertAssuredIDCA-1.crl0w+k0i0$+0�http://ocsp.digicert.com0A+0�5http://cacerts.digicert.com/DigiCertAssuredIDCA-1.crt0
http://www.digicert.com/ssl-cps-repository.htm0�d+0�V�RAny
http://ocsp.digicert.com0C+0�7http://cacerts.digicert.com/DigiCertAssuredIDRootCA.crt0��Uz0x0:�8�6�4http://crl3.digicert.com/DigiCertAssuredIDRootCA.crl0:�8�6�4http://crl4.digicert.com/DigiCertAssuredIDRootCA.crl0U+����ߢ�W
Similar to "Extracts part of the match as a second URL" cases above:
185.189.58[.]222
Extracts as:
http://58.222
Some more information on some of the bugs we're seeing here:
Actual output | Expected output | Bug description |
---|---|---|
http:// NOTICE |
None | Not sure if we can fix this, it does match the regex. |
https://redacted.sf-api.eu/</BaseUrl |
https://redacted.sf-api.eu/ |
See if we can get this working with the existing punctuation filter |
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f (Please |
https://ln.sync[.]com/dl/f6772eb20/d8yt6kez-9q7eef3m-ai27ebms-8zcufi5f |
Extra cruft after the URL |
http://as rsafinderfirewall[.]com/Es3tC0deR3name.exe): |
http://rsafinderfirewall[.]com/Es3tC0deR3name.exe |
Unicode space (\xa0) should end the URL; end punctuation not being stripped |
http://domain rsafinderfirewall[.]com |
http://rsafinderfirewall[.]com |
Unicode space should end the URL |
http://example,\xa0c0pywins.is-not-certified[.]com |
http://c0pywins.is-not-certified[.]com |
Unicode space should end the URL |
webClient.DownloadString(‘https://a.pomf[.]cat/ntluca.txt |
https://a.pomf[.]cat/ntluca.txt |
Junk getting through the bracket regex before the prefix |
http://HtTP:\\193[.]29[.]187[.]49\qb.doc\u201d |
HtTP:\\193[.]29[.]187[.]49\qb.doc |
Handle backslashes as a defang/refang; include unicode quote as punctuation in regexes |
http://tintuc[.]vietbaotinmoi[.]com\u201d |
http://tintuc[.]vietbaotinmoi[.]com |
include unicode quote as punctuation in regexes |
espn[.]com.\u201d |
include unicode quote as punctuation in regexes | |
http://calendarortodox[.]ro/serstalkerskysbox.png” |
include unicode quote as punctuation in regexes | |
tFtp://cFa.tFrFa |
??? | No idea... investigate the source to see what this was supposed to be |
h\u2013p://dl[.]dropboxusercontent[.]com/s/rlqrbc1211quanl/accountinvoice.htm |
This is actually correct, but the refang function needs to handle unicode em-dash. | |
hxxp://paclficinsight.com\xa0POST /new1/pony/gate.php |
hxxp://paclficinsight.com |
Just stop on the \xa0 unicode space |
http://at\xa0redirect.turself-josented[.]com |
||
KDFB.DownloadFile('hxxps://authenticrecordsonline[.]com/costman/dropcome.exe', |
||
at\xa0hxxp://paclficinsight[.]com/new1/pony/china.jpg |
||
hxxp://<redacted>/28022018/pz.zip.\xa0 |
hxxp://<redacted>/28022018/pz.zip |
No way to recover the redacted unfortunately... just drop the \xa0 and pass the rest even though this is useless as an IOC |
hxxp:// 23.89.158.69/gtop |
Same \xa0 issue | |
h00p://bigdeal.my/gH9BUAPd/js.js"\uff1e\uff1c/script\uff1e |
h00p://bigdeal.my/gH9BUAPd/js.js |
More unicode regex tightening |
hxxp://smilelikeyoumeanit2018[.]com[.]br/contact-server/, |
Comma should be stripped | |
hxxp:// feeds.rapidfeeds[.]com/88604/ |
||
hxxp://www.xxx.xxx.xxx.gr/1.txt\u2019 |
||
h00p://119 |
Piece of a IP URL... should probably filter these out somehow, maybe this is solved by whatever solves the "Extracts part of the match as a second URL" cases | |
h00p://218.84 |
||
hxxp:// "www.hongcherng.com"/rd/rd |
||
http://http%3a%2f%2f117%2e18%2e232%2e200%2f |
Extra scheme for some reason... | |
http://http%3a%2f%2fgaytoday%2ecom%2f |
||
h00p://http://turbonacho(.)com/ocsr.html"\uff1e |
Extra scheme and unicode issues |
This is the source of the cFa.tFrFa
ioc: https://malware.news/t/technical-teardown-analysing-malspam-attack/11149. There's some obfuscation here that's beyond what we can handle as a defang. I think this one can be ignored. The real indicator is listed later in the post anyway.
Hey,
I can answer the question above. This really isn't about an IOC (it is an IOC) but more about obfuscation.
It's an obfuscated url.
‘FhFtFtp://cFa.tFrFadeFlaFtFinosF.Fco/jFsF90F.FbinF?’
= http://ca.tradelatinos.co/js90.bin?
Thanks :) Unfortunately the way we're getting this text, it's split up so that we can't regex out the full obfuscated URL:
‘iFlFe(‘FhFtFtp://cFa.tFrFa’ +
‘deFlaFtFinosF.Fco/jF’ +
On top of that, the every-other-character obfuscation is more complicated than the simple defangs this library was meant to cover, so there's no good way to parse it out. That said, the deobfuscated URL is also contained later in the same text, so we do parse that out correctly - we just get an extra false-positive URL coming through as tFtp://cFa.tFrFa
that an analyst would have to manually remove/ignore. Not a big issue, just something I noticed while combing through some test data.
Oh, to clarify, we're not looking at/extracting from the original file here, only the RSS feeds of a bunch of security blogs. That probably wasn't clear at all in the issue context.
No problem and agreed, it appears to be outside of the scope of the tool. Good job, I'll use this in the future I'm sure so 😀.
As a side note. If you want some good regex's check out the source code of cyber chef, GCHQs tool. You have many covered already though. I'll contribute where I can.
Thanks for the tip!
CyberChef regex for future reference: https://github.com/gchq/CyberChef/blob/master/src/core/operations/Extract.js. The IPv6 seems more advanced than ours for sure.
Closing via #24, which fixes most of the remaining bugs from this issue.
Hold-all issue for invalid URLs I find that come through extraction.
URLs with wildcard/regex:
Extracts part of the match as a second URL: