VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

TorProxy not connecting #220

Closed 3v1lb1t closed 2 years ago

3v1lb1t commented 3 years ago

I am currently working on a project to index over tor. I have the docker-compose file and ache file configured but it appears that torproxy is failing to connect to a circuit which is causing the crawler to abort downloads with Fetch_Duration_Exceeded. I have tried a few different configurations with the docker network as well as the virtual machine where the crawler is running. I started a VPN on the VM to attempt to help but that didn't work. I tried installing tor on the virtual machine and torifying all commands through the shell which didn't work either.

aecio commented 3 years ago

Does the example in https://github.com/VIDA-NYU/ache/tree/master/config/config_docker_tor work fine?

3v1lb1t commented 3 years ago

It appears fetch duration is too short. I did a local install of Ache with a docker torproxy running and was able to get successful requests crawled. When I was testing this, it took quite some time for the Tor circuit to be created. I'm not sure exactly how long but upwards of 2 minutes. I basically just started the torproxy machine, pointed my browser to use that as a proxy, then waited for something to load through it.

3v1lb1t commented 3 years ago

It appears fetch duration is too short. I did a local install of Ache with a docker torproxy running and was able to get successful requests crawled. When I was testing this, it took quite some time for the Tor circuit to be created. I'm not sure exactly how long but upwards of 2 minutes. I basically just started the torproxy machine, pointed my browser to use that as a proxy, then waited for something to load through it.

aecio commented 3 years ago

There are some fetch timeout configurations for the TOR fetcher that you can include in ache.yml that may help:

crawler_manager.downloader.tor.max_retry_count: 3
crawler_manager.downloader.tor.socket_timeout: 300000
crawler_manager.downloader.tor.connection_timeout: 300000
crawler_manager.downloader.tor.connection_request_timeout: 300000

All configurations for the downloader are listed in this class: HttpDownloaderConfig.java

3v1lb1t commented 3 years ago

Perfect, thank you

aecio commented 2 years ago

Closing this issue. Documentation regarding timeouts has been added to the docs: https://ache.readthedocs.io/en/latest/http-fetchers.html#setting-connection-timeouts