Closed tpolo777 closed 2 years ago
Yes, the docs are up to date.
Whenever a crawl is started, a web server is started by default on port 8080. ACHE will print the address on the logs. When you open it on the browser, you can see some crawler statistics as well as search the content. The search will only work if you have configured Elasticsearch. The sample configuration in {ACHE_ROOT}/config/config_docker is an example that includes a configuration for elasticsearch using Docker. You can see this to try out the search feature.
Finally, ACHE also stores some TSV files in its output folder. One of the files, relevantpages.csv
, includes only the pages classified as relevant by the page classifier provided.
On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.
Here is my Configuration File