MetaMask / eth-phishing-detect

Utility for detecting phishing domains targeting Web3 users
Other
1.09k stars 943 forks source link

Where is the hosts file in this repo? #84201

Open StevenBlack opened 3 hours ago

StevenBlack commented 3 hours ago

Hello! 👋🏻

I'm confused about something.

My hosts repo uses the MetaMask/eth-phishing-detect list as a source. The update.json file presently looks like this:

{
  "name": "MetaMask eth-phishing-detect",
  "description": "Phishing domains targeting Ethereum users.",
  "homeurl": "https://github.com/MetaMask/eth-phishing-detect",
  "frequency": "frequent",
  "issues": "https://github.com/MetaMask/eth-phishing-detect/issues",
  "url": "https://raw.githubusercontent.com/MetaMask/eth-phishing-detect/master/src/hosts.txt",
  "license": "DON'T BE A DICK PUBLIC LICENSE"
}

Note that https://raw.githubusercontent.com/MetaMask/eth-phishing-detect/master/src/hosts.txt returns a hosts file. But this repo presently doesn't contain a hosts file.

I also notice that, at some point, a branch name change from master to main happened which, of course, silently breaks absolutely everything downstream.

I understand that my hosts repo is a derivative work, but downstream from me is gigantic, maybe gargantuan, even.

I'd just like some clarification about where the hosts.txt file has gone, and whether MetaMask/eth-phishing-detect should still be distributed to the world downstream via my hosts project.

samczsun commented 2 hours ago

Hi, and thanks for reaching out! The hosts file hadn't been updated since 3 years ago so we removed it, not realizing there were still active consumers.

The easiest way to integrate with this repo at this time is to parse the config.json and read the list of hosts from the blacklist section. However, as you can see this repository is now being synchronized with a separate data source hosted by SEAL-ISAC. If you'd rather establish a programmatic feed (or simply have the bot also submit merge requests to your hosts repo) we can explore that option as well.

StevenBlack commented 2 hours ago

Thanks for the reply @samczsun.

I'll pull from blacklist. Where is that? I must be blind; I can't find blacklist here.

BTW I think branch master should be removed from remote.

When you switch from master to main that's a bit woke — and I'm ok with that — but leaving master up on remote kinda shows, nobody really thought about downstream implications very much.

Keeping master on remote means downstream will just silently keep pulling error-free stale from master. In this case, two years of your diligent work didn't actually reach all that is downstream from me.

It's just weird to switch to main but leave master on remote.

StevenBlack commented 2 hours ago

@samczsun I found it, it's in https://raw.githubusercontent.com/MetaMask/eth-phishing-detect/refs/heads/main/src/config.json.

samczsun commented 2 hours ago

I won't comment on motivations of the rename (partly because that was before I joined as a maintainer, partly because I don't care), but we did keep an action syncing from main -> master for a year before we turned that off 2 weeks ago.

The blacklist I'm referring to is in config.json.

Unfortunately, removing master is not something that I can do, but I can flag with the repository admins.

StevenBlack commented 2 hours ago

Thanks for the clarification @samczsun.

I see that blacklist is over 200,000 domains, unsorted.

I always wonder how people can actually curate an unsorted 200,000+ item list.

Has the MetaMask blacklist become an add-only bucket, now? Because that's typically what happens with long, unsorted lists of domains.

samczsun commented 2 hours ago

We expire domains manually at the moment, most recently 3 weeks ago, with plans to have domains fall out of the list automatically in the future.

It's unsorted because the diff to sort it would be immense and impossible to review and while the bot automates 99% of contributions, people still occasionally open PRs into this repo.

StevenBlack commented 2 hours ago

@samczsun here's some info for you.

This is output from a little utility I'm developing.

The very last line, Intersection: 386 domains means of your 204,000 domains, the intersection with my amalgamated list of 114,700 domains is just 386 domains. Which feels extremely fishy.

The long lists here are the top 100 TLD and the top 100 root domains in the MetaMask blacklist. The blacklist presently holds 12,911 gitbook.io subdomains, which seems wildly improbable to me. Maybe all this is helpful to your list maintainers?

Name:
Location: text input
Domains: 204,101
Duplicate domains: 0
Invalid domains: 4
TLD:
            com:   56,303
             io:   19,067
            dev:   15,724
            xyz:   14,490
            app:   13,815
            net:   12,381
            org:    8,229
            top:    4,858
            pro:    3,786
        network:    3,643
         online:    3,457
           site:    3,160
             cc:    2,911
           info:    2,747
           live:    2,038
             co:    1,910
        finance:    1,791
           tech:    1,436
         events:    1,344
         claims:    1,135
          trade:    1,037
          space:    1,002
            vip:      929
            icu:      820
        trading:      810
          cloud:      795
            fun:      761
           shop:      742
          click:      717
             me:      664
        website:      655
           club:      652
             fi:      650
            lol:      650
          store:      636
        support:      620
          world:      620
            one:      593
           link:      546
             in:      508
             us:      477
           life:      411
            cfd:      403
            biz:      394
        digital:      370
     foundation:      369
             pw:      368
             eu:      340
           gift:      337
           buzz:      317
       exchange:      314
            sbs:      304
            art:      298
             ru:      254
          homes:      253
           pics:      242
           land:      220
            ink:      217
           cash:      216
             br:      194
            ltd:      189
            wtf:      171
             su:      156
          quest:      151
            run:      146
           cyou:      143
          gifts:      139
            mom:      139
             uk:      137
          games:      124
           blog:      119
             de:      111
            lat:      109
          build:      108
           zone:      106
          codes:      100
           work:      100
            win:       98
             cn:       97
             id:       96
           news:       96
      community:       95
             to:       95
          today:       93
      financial:       92
             pm:       90
        capital:       89
             mx:       88
           fund:       85
         global:       82
          money:       78
           bond:       75
            bio:       71
             ws:       71
           guru:       70
             re:       70
             cx:       69
             cl:       66
             it:       66
             pl:       66
Root domains:
  pages.dev: 15,247
  gitbook.io: 12,911
  vercel.app: 3,935
  web.app: 2,115
  webflow.io: 1,474
  netlify.app: 652
  azurewebsites.net: 590
  github.io: 228
  com.br: 174
  drop-premint.com: 165
  glitch.me: 159
  nft-premints.xyz: 151
  drop-premint.xyz: 146
  dweb.link: 145
  on-fleek.app: 134
  whitelist-web3.com: 134
  free-limited.com: 126
  onrender.com: 116
  nft-whitelist.com: 103
  cf-ipfs.com: 100
  42web.io: 98
  mypinata.cloud: 92
  airdrop-whitelist.com: 88
  r2.dev: 75
  zeeve.online: 73
  firebaseapp.com: 67
  co.uk: 66
  blogspot.com: 59
  limited-drops.com: 59
  cprapid.com: 54
  fleek.co: 54
  com.co: 45
  pantheonsite.io: 45
  zeeve.net: 44
  co.za: 40
  workers.dev: 37
  co.in: 36
  us.com: 35
  weebly.com: 33
  web3-whitelist.com: 32
  b12sites.com: 31
  surge.sh: 31
  us.to: 30
  wordpress.com: 30
  duia.us: 29
  com.ng: 28
  typeform.com: 27
  4everland.app: 26
  co.ke: 26
  com.au: 26
  mooo.com: 26
  com.tr: 25
  netlify.com: 25
  bitballoon.com: 24
  com.mx: 24
  csb.app: 24
  launchpadex.com: 24
  metamask.cafe: 24
  zendesk.com: 24
  com.ar: 21
  godaddysites.com: 21
  line.pm: 21
  amazonaws.com: 20
  bsquarefli.co: 20
  com.de: 20
  hyperlockflnance.com: 20
  plesk.page: 20
  000webhostapp.com: 19
  bsquaredfii.net: 19
  fanasytops.net: 19
  fanlasytop.net: 19
  my.id: 19
  talkonet.com: 19
  astarnetworks.co: 18
  fantasytops.org: 18
  hyperiockfinance.com: 18
  pryzn.net: 18
  work.gd: 18
  bsquarenetwork.org: 17
  com.se: 17
  fanasytop.com: 17
  web3-l2.cfd: 17
  duckdns.org: 16
  pacmoonfl.net: 16
  artblucks.io: 15
  bsquaredfii.co: 15
  bsquaredfii.com: 15
  co.il: 15
  com.pl: 15
  crypto-list.info: 15
  in.net: 15
  pdma.live: 15
  registers-welikethefox.com: 15
  aerodromefinance.events: 14
  airdrop-tokens.com: 14
  bg-parite-received.fun: 14
  canva.site: 14
  com.pk: 14
  us.org: 14
  wuaze.com: 14

Intersection: 386 domains
samczsun commented 2 hours ago

Yes, this is a list of domains intended to be consumed by applications targeting cryptocurrency users. The Gitbook entries come from a brand protection partner and are a little different from the typical entries you might find for drainers and other outright scams, but still fall under the remit of this repository. Ideally in the future we would better label the type of scam that the domain represents, but at the moment no one has time to implement such a breaking change.

StevenBlack commented 2 hours ago

Thanks for everything @samczsun. I'm going to drop MetaMask/eth-phishing-detect from distribution principally because, it's much too large now. Additionally, bot-propelled add-only buckets isn't what we do, as a matter of principle.

But please, ping me when the MetaMask/eth-phishing-detect blacklist comes under active management in the future. Good?

samczsun commented 1 hour ago

Sure. You'll be pleased to know that it is currently under active management, but it will likely remain add-only with manual expiry for the medium term. Unfortunately, given the volume of scams targeting cryptocurrency users, it's possible that the list will continue to remain too large for your use case, even after we implement automated expiration. For example, pruning anything older than a year leaves us with 175k entries, and we are likely unwilling to go any lower than that to start due to the possibility of a threat actor re-activating the domain once it's removed from the list.