Only get the original link when crawling onion sites

DedSecInside / TorBot

Dark Web OSINT Tool

Other

2.73k stars 509 forks source link

Only get the original link when crawling onion sites #334

Closed 0xEnders closed 6 months ago

0xEnders commented 7 months ago

Hi guys,

was following the guide step by step. However when i tried crawling a particular link i only get that link returned even though manually navigating TOR shows that there are multiple other links. Have tried for a few different websites but still having the same issue. Am unsure if its because of my settings or a bug.

Please advise.

KingAkeem commented 7 months ago

What's the link so that I can try to reproduce it? Also can you provide more information such as

Operating System
Which version of TorBot that you're using?
How you're executing the application?
TOR configuration

0xEnders commented 7 months ago

Thanks for the quick reply!

I am trying the links :

http://alphvmmm27o3abo3r2mlmjrpdmzle3rykajqc5xsj7j7ejksbpsa36ad.onion/ http://noescapemsqxvizdxyl7f7rmg5cdjwp33pg2wpmiaaibilb4btwzttad.onion/

Operating System : Ubuntu 22 Which version of TorBot that you're using? : current dev version. i git cloned it

How you're executing the application? python3 torbot -u http://website.onion --depth 2

TOR configuration : default config sudo apt install tor sudo service tor start

Also, is there a way to crawl based on a text file of email addresses?

KingAkeem commented 7 months ago

You're welcome and thanks for providing the information, I'll look into it later today or sometime this week. There is no feature to crawl email addresses, the current program operates on HTML retrieved from sites so I don't know how that would be possible with email addresses but if you have suggestions for a new feature then feel free to submit a ticket and it'll be looked into. If you already know how the feature should be implemented then you can take a crack at it and submit a pull request to the repo.

0xEnders commented 7 months ago

correction, text file of websites* not email addresses. And thanks for looking into it. ill go and mess around with the settings and see what happens. 2 other things :

Is it recommended to amend the torcc config file? Because i didnt touch that and all
Can I get a link to the slack channel? The link on the main page has expired.

Thanks once again!

KingAkeem commented 7 months ago

It's your choice. I've created CLI flags to dynamically define the SOCKS5 proxy when instantiating the HTTPS client.
The link should still work, but the Slack channel is not highly used. If you have suggestions, thoughts, or problems. You'll likely get the quickest response from posting here.

0xEnders commented 7 months ago

There's no way for us to crawl multiple websites at once right?

KingAkeem commented 7 months ago

Not currently, it'd probably be a fairly straightforward feature to implement but no one has requested it. If you want to know what's possible or not, check the README. If you have ideas or suggestions, create a new ticket.

KingAkeem commented 7 months ago

Or build it out yourself and submit it if you're capable.

KingAkeem commented 7 months ago

I checked the URLs and the reason why it's only returning the host domain is that all of the links are paths within the same domain. The scraper looks for unique host domains that are fully qualified URIs. All of the links are paths to the same domain, not different sites.

KingAkeem commented 7 months ago

I'll look into modifying the feature to identify paths.