Closed 3v1lb1t closed 2 years ago
Does the example in https://github.com/VIDA-NYU/ache/tree/master/config/config_docker_tor work fine?
It appears fetch duration is too short. I did a local install of Ache with a docker torproxy running and was able to get successful requests crawled. When I was testing this, it took quite some time for the Tor circuit to be created. I'm not sure exactly how long but upwards of 2 minutes. I basically just started the torproxy machine, pointed my browser to use that as a proxy, then waited for something to load through it.
It appears fetch duration is too short. I did a local install of Ache with a docker torproxy running and was able to get successful requests crawled. When I was testing this, it took quite some time for the Tor circuit to be created. I'm not sure exactly how long but upwards of 2 minutes. I basically just started the torproxy machine, pointed my browser to use that as a proxy, then waited for something to load through it.
There are some fetch timeout configurations for the TOR fetcher that you can include in ache.yml that may help:
crawler_manager.downloader.tor.max_retry_count: 3
crawler_manager.downloader.tor.socket_timeout: 300000
crawler_manager.downloader.tor.connection_timeout: 300000
crawler_manager.downloader.tor.connection_request_timeout: 300000
All configurations for the downloader are listed in this class: HttpDownloaderConfig.java
Perfect, thank you
Closing this issue. Documentation regarding timeouts has been added to the docs: https://ache.readthedocs.io/en/latest/http-fetchers.html#setting-connection-timeouts
I am currently working on a project to index over tor. I have the docker-compose file and ache file configured but it appears that torproxy is failing to connect to a circuit which is causing the crawler to abort downloads with Fetch_Duration_Exceeded. I have tried a few different configurations with the docker network as well as the virtual machine where the crawler is running. I started a VPN on the VM to attempt to help but that didn't work. I tried installing tor on the virtual machine and torifying all commands through the shell which didn't work either.