GoSecure / freshonions-torscraper

Fresh Onions is an open source TOR spider / hidden service onion crawler
GNU Affero General Public License v3.0
81 stars 24 forks source link

Installation Issues #17

Open castlelords opened 4 years ago

castlelords commented 4 years ago

Hi - I followed the updated readme and would appear that all the tests work.

curl -v --socks5-hostname 127.0.0.1:9051 http://.onion curl -v --socks5-hostname 127.0.0.1:9054 http://.onion curl -v --proxy 127.0.0.1:3129 http://.onion curl -v --proxy 127.0.0.1:3132 http://.onion

All return back HTML.

When harvest.sh starts it starts out fine. I then notice that I get a lot of 403: forbidden.

https://www.deepwebsiteslinks.com/tor-emails-chat-rooms-links/ Resolving www.deepwebsiteslinks.com (www.deepwebsiteslinks.com)... Connecting to www.deepwebsiteslinks.com (www.deepwebsiteslinks.com)|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2019-09-01 09:51:50 ERROR 403: Forbidden.

I was wandering if this is an issue with running it through whonix?

I let it continue to run and then get to the section where it writes to SDOUT.

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed

This takes around 2 days to complete and I get a mixture of total 100% received and;

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to receive SOCKS5 connect request ack.

Is this due to the onion link being dead?

Anyone interested in working together drop me a line. This is a project I would like to continue with.

L3houx commented 4 years ago

Hi @castlelords!

Did you install the project using docker or the manual way?

The error you get is linked to unreachable sites or onions. Into the beginning of the crawling process, the crawler will go through a list of sites to get onions links. The error said that one or many of the sites of the list are unreachable. It happens when a site is not hosted or the delay to connect to it is too long.

I think that it's supposed to work on whonix, but I can't confirm it. I think the 2 days is linked to your OS, because you are passing everything through Tor (normal sites and onions).

If you look at the Web interface, you should see a list of onions. If you don't, that means that there is another problem.

Sorry for the delay

castlelords commented 4 years ago

Hi L3houx

Thank you for taking the time to reply.

I have setup manually. The docker setup didn't go through 100% without errors and I wanted to learn the process step by step.

Since posting, I ran the setup through whonix and without and the same issue happened. Which rules out whonix. My setup is very basic - Virtualbox on windows host running ubuntu which is running the scraper and db. Did you have any instructions for installing the infrastructure you recommend in your readme at all?

When I look at the web interface there is 'no results displayed' So like you say another problem has occurred which I will work on reducing the delay which is causing it to time out which I will try and correct.

Will the freshonions harvest.sh use tor2web for the whole process or just the initial scrape of onion addresses? Castlelords

L3houx commented 4 years ago

Hi @castlelords!

Right now, I run the docker version of Freshonions-Torscraper on my laptop that runs on Ubuntu 19.04, and it is all working well. Maybe having a machine with more than 16Go of DDR4 could be a good idea.

The crawler goes through all the page and visits all links it finds. So when it finds an image or a PDF (something different than a web page), it will cause an error. These are acceptable and without repercussions, but it needed to be fixed.

I would recommend you to test with only one valid URL in the harvest script. Comment all the others and try to harvest with only one URL. This way, you will be able to see if the whole process works or not. If not, it could be linked to the initial installation that misses something.

I quickly look at the tor2web and I think that it's only used during the harvesting process.

For the next post, please submit screenshots of errors, it will be easier for me to help you solve these errors.

L3houx

castlelords commented 4 years ago

Hi L3houx,

I will certainly try one link to test the process. If I still have an issue then I will post screen shots. I will also be trying a fresh install using the docker to test that again. This time without whonix.

I will post how I get one. Just out of interest do you use a VM for your install?

castlelords

By means of an update - Today I installed using the docker file and it looks like it went through okay. I am at the same point as before which continued straight through when I ran the docker file;

freshonions-torscraper-crawler |Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to receive SOCKS5 connect request ack.

I have had it running for 6 hours now and will see what happens. I have written the process out to a file so have that if it's needed.

Interestingly I am not able to connect to the ./web.sh address of http://0.0.0.0:5000/

castlelords commented 4 years ago

Hi,

Last night I left the process running. At some point during the night it stopped working. I did the following actions;

  1. I edited the harvest.sh to take out all onions apart from one line. The new .sh file is called old-harvest.sh.

  2. The instructions say don't run the docker-compose only do it once when all the containers are build for the first time. so after my restart I ran ./old-harvest.sh

Ubuntu 19 04-amd64 Clean Install (freshonions installed)  Running  - Oracle VM VirtualBox _ 1 08_09_2019 09_32_58

  1. After a restart should any other script be run for the scraper?
L3houx commented 4 years ago

Hi,

If you didn't install the requirements (python packages), you need to do it. "Pony.orm" is a package python. Do you use the manual way or the docker installation? Because at the beginning you were talking about the manual way, but you switch to the docker one? If you decide to use the docker installation, you need to do "sudo docker-compose up". After that, when it's the first time that you start the infrastructure, you need to initiate Elasticsearch: image

If you initialize Elasticsearch several times, it will empty Elasticsearch every time you did it. It is the reason why I said that you only need to do it once.