Error: missing domain for <...>

BlockMageSec commented 1 year ago

Hi, first off, thanks for publishing this! I am very eager to begin training using some web3-specific datasets we are curating, and we have a massive amount of domains which are confirmed phishing domains (37,000+ to start, and we process 200/day depending on the given week) with which we intend to train.

I am having one issue with the extract command, however. Let me say, I am not experienced with rust, so even though I tried to adjust the code in main.rs myself basically by guessing, I have not been able to resolve the issue.

It appears that it is getting hung on an entry wherein the document in the pages collection or domains collection, either one, while both entries exist, is written with a trailing slash, ie. https://example.com/ in one collection is written https://example.com in the other.

I just checked the entries I fed into the crawler from a file, and found the domain it was erroring on. In my input list of domains, it's not written with a trailing slash, so I assume at some point it's been appended to the document entry in mongodb while on either core or domain tasks.

Lastly I tried this another time, with the same input, into a second database just to test, and I have the same error on a different domain this time, but otherwise all the same variables as I have described. I imagine that must be due to some sites being live or offline at either time, thus different results.

Happy to provide any further information as requested.

71 commented 1 year ago

Indeed, this sounds strange(-ly nondeterministic). I'll look into it in more details, but in the meantime thank you for the bug report and for the PR!

71 commented 1 year ago

For reference, the domains table should always have simple host strings, so no scheme like http:// and no path (including first / character). For pages, it depends on the redirections and, I suppose, the first link we saw pointing to that page. I still don't understand how this can impact the extract command, nor how a domain entry may be missing for a particular page, though...

BlockMageSec commented 1 year ago

For reference, the domains table should always have simple host strings, so no scheme like http:// and no path (including first / character). For pages, it depends on the redirections and, I suppose, the first link we saw pointing to that page. I still don't understand how this can impact the extract command, nor how a domain entry may be missing for a particular page, though...

I am happy to show where it happened, if that helps; the source of this particular list itself is public from @MetaMask/eth-phishing-detect; I simply converted the items from the blocklist into a single-line, whitespace-delimited input.txt file.

However, I will note: I was expecting the same behavior - that is, that the domains should require no scheme or path. I recall there being some fuss with my attempt to load these without http/s and so I wound up cleaning the list and adding the prefixes.

I am going to run through this process again soon (today, is my hope, time permitting) and I can update with any further information.

BlockMageSec commented 1 year ago

While this may not be the appropriate place for this suggestion or my comments to follow, I believe it is important to the overall process of ingesting new domain entries and serves as general feedback; and otherwise, there's nowhere more applicable for the remainder, so hopefully this is not a problem 🙂

The suggestion is simply to allow for an alternative means of adding new domains to the queue. This is based on my experience with bulk entries. For example, processing a JSON document containing 37,000+ domains into the format of single-line, whitespace-delimited text required for intake can be quite cumbersome.

I would also like to extend an invitation to collaborate further, if interested. Let me explain.

Our team consists of a handful of developers and cybersecurity professionals. While I am not an expert in ML, and while I love code, I don't consider myself a developer either (but I'm not so bad 😅). The bulk of our focus is carried out as volunteer efforts. We are proactively working hard to minimize threats and diminish fraudulent activity in Web3 (cryptocurrencies communities).

We hope to modify some of the crawler behavior but we could use some guidance/assistance. Naturally, we also intend to share our trained models publicly once they are adequately tested. However, our primary detections are kept private due to the advantage it would bring to threat actors in their attempts to evade discovery.

We are constantly discovering new phish, so we are considering ways to stream-in these detections, even if they are just submitted manually via an HTTP POST request or something simple like a Slack command. We are thinking about how we might parse a JSON or YAML file with allowlist/blocklist entries to be fed to the queue, as this would be significantly more efficient, and we can streamline the process in conjunction with our other activities.

If you're interested, let's talk more via Slack or email. Feel free to reach out to adam@blockmage.dev. Otherwise, thank you & all involved for your hard work on this framework. It's the only one we've come across that can reasonably be used on the type of phish we're hunting.

71 commented 1 year ago

Ah, there is some confusion. The domains collection (which is used for the lookup in the code you modified) has domain strings with no slashes nor scheme. The pages collection has full URLs.

When you use the add command, you're adding entries to the pages collection, and when these entries are first processed, corresponding entries in the domains collection will be created. Since the pages collection expects full URLs, you do need the scheme and path components (and the worker will strip them before creating the domains entry).

The add command is very much a helper for adding domains; since the backing storage is a MongoDB database, operations on the data (e.g. insertions, fixing invalid entries following crashes) are expected to be performed directly there. All you need to do is to insert a document with the url, depth and is_phishing fields, and it will be scheduled for processing (more precisely, the lack of further fields will allow the core worker to process it). The insertion logic is as follows (excluding handling of already existing entries): https://github.com/TristanBilot/phishGNN/blob/fb2505a3b6d606c4c8969a7dfc8c926fca1d571c/crawler/src/main.rs#L292-L311

The core worker logic is here: https://github.com/TristanBilot/phishGNN/blob/fb2505a3b6d606c4c8969a7dfc8c926fca1d571c/crawler/src/main.rs#L221-L229

If you'd like to extend (or simply better understand) the crawler part of the project (which I'm most familiar with) I'd be happy to help, but please note that I have little bandwidth to dedicate to this.

Good luck making the Web3 ecosystem safer!

BlockMageSec commented 1 year ago

Thank you for your comprehensive response and offer! I certainly understand the bandwidth situation myself as well; no worries there.

I feel confident enough at the moment that I will attempt some of these changes myself, but I will reach out if necessary 🙏🏼

I will close this issue now since everything seems to be resolved.

TristanBilot / phishGNN

Error: missing domain for <...> #5