[Feature]: Extract specific TLD from submitted URLs

Lookyloo / lookyloo

Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.

https://www.lookyloo.eu

Other

679 stars 83 forks source link

[Feature]: Extract specific TLD from submitted URLs #936

Open adulau opened 1 day ago

adulau commented 1 day ago

Is your feature request related to a problem? Please describe.

Yes, to get automatically via the API or command line the list of domains per TLD.

Describe the solution you'd like

A command line tool or an API end-point.

Describe alternatives you've considered

No response

Additional context

No response

Rafiot commented 1 day ago

Just a note on that: the indexes are stored in sets instead of sorted sets, which makes pagination impossible on big datasets (we get big list of UUIDs, not knowing if they're old or new).

The TLDs will be the first of the indexes to be stored in a sorted list (UUID - capture timestamp), but all the indexes will need to be migrated to the new format over time.

Rafiot commented 1 day ago

One capture has many URLs, so we have multiple TLDs. The quick and dirty implementation is iterating over all the URLs in the capture, getting the TLD, storing the UUID of the capture in the appropriate sorted set. Problem is that we have then no idea what the actual URL was, just that somewhere in that tree, there is a URL with that TLD. Quick, yet maybe not very useful if we want to get a list of URLs with that TLD (note that we have the issue with the URLs and hostnames, we just know they're somewhere in that tree).

The other approach, which is the one used for the HTTP header hashes, is to store a tuple in the set (capture_uuid|urlnode_uuid), this way, we can get back to the node in the tree. It uses quite a bit more ram (com will be an insanely huge set), but at least we can get back to the URL without searching in the whole tree.

Rafiot commented 18 hours ago

Few more notes on that, the implementation I landed on is going to be much more long-term proof than what we have now and it will involve kind of layered indexing that will work this way:

from a tld -> get all the captures with at least one URL with it (by capture_uuid)
Then, from the UUID, use the (to be implemented) capture index that allows to get all the nodes with that tld
Get the URLs of the matching nodes

The capture index will also allow to get nodes related to a specific hhhhash, cookie, ...