elceef / dnstwist

Domain name permutation engine for detecting homograph phishing attacks, typo squatting, and brand impersonation
https://dnstwist.it
Apache License 2.0
4.85k stars 767 forks source link

ssdeep #147

Closed kevinpeters811 closed 2 years ago

kevinpeters811 commented 2 years ago

I cloned the main page of one of our web sites and put it on a server with a known permutation that I registered.

Since the main pages are the same, I expected ssdeep would identify them as very similar. But I am getting a zero ssdeep value.

Any ideas?

elceef commented 2 years ago

In that case you should have ssdeep score close to 100%. What version of dnstwist are you using? By default, if only original domain name is provided as input, dnstwist connects using http:// protocol. I guess the original website redirects such a queries to https://. Are you serving the mirrored web page on both http and https? You can also run the tool with --debug argument. It's a bit noisy but you should be able to filter out HTTP connection related issues.

kevinpeters811 commented 2 years ago

Thanks. I took a closer look and ssdeep does indeed return zero in your program. As a test, I grabbed your r.normalized content for both the original and my clone and then manually called ssdeep to hash and compare and also got a zero, even though the content is clearly virtually identical test.txt .

elceef commented 2 years ago

Both r.normalized_content are very similar, but not identical. That explains different ssdeep hashes, although zero score is a bit surprising. Could you share raw r.content too?

kevinpeters811 commented 2 years ago

Here is r.content.

Thanks for your help

test1.txt .

elceef commented 2 years ago

There is zero score for the raw inputs too. The main reason is that the inputs use different line endings conventions (CRLF vs LF). Nonetheless, I think I can tune the content normalizer a bit and get a positive score. Stay tuned.

kevinpeters811 commented 2 years ago

Great. Thanks.

On Tue, Mar 1, 2022 at 5:13 PM Marcin Ulikowski @.***> wrote:

There is zero score for the raw inputs too. The main reason is that the inputs use different line endings conventions (CRLF vs LF). Nonetheless, I think I can tune the content normalizer a bit and get a positive score. Stay tuned.

— Reply to this email directly, view it on GitHub https://github.com/elceef/dnstwist/issues/147#issuecomment-1055910517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJYB3FGWOKZC22R4LDKI3U52I63ANCNFSM5PSH3PFQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

elceef commented 2 years ago

Pull the most recent version and try it. I'm getting ssdeep score 60%.

kevinpeters811 commented 2 years ago

I get 60% as well. Thank you very much for looking into this.

elceef commented 2 years ago

Thanks for bringing this up. Initially I made it 43%, but then added code which clears attribute values for certain HTML tags, which are usually modified when an offline snapshot (mirror) is made. I think the ssdeep feature should be more accurate now.

elceef commented 2 years ago

You might want to look at the phash feature as well which has been introduced recently. In short, it renders web pages, takes screenshots and compares them visually.

olifre commented 2 years ago

This is indeed a rather cool feature, looking forward to use it in the next release :+1: . I believe this also adds new recommended dependencies — will Chromedriver also be part of the dnstwist Docker container?

elceef commented 2 years ago

I haven't decided yet, but most likely it won't. I'd like to keep the Docker container as small as possible. Introducing chromedriver and depending web browser will make it a few times heavier. For now I consider the pHash feature extra/experimental.

olifre commented 2 years ago

Good point, keeping the container as small as possible is a good goal for most use cases, and chromedriver is indeed very bulky. Another option would be to add a second, more "fat" container with all "extra features", using the first container as base, adding all the additional features as another layer. Of course, that increases maintenance effort (and / or automation effort). An advantage would be that it increases the number of potential testers.