VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Some questions Ubuntu 16.04 #172

Closed tpolo777 closed 2 years ago

tpolo777 commented 5 years ago

On the main page of Ache repo description reads ''Web interface for searching crawled pages in real-time''. but when I started running ache there was no interface. (Maybe i misunderstood something) Also I stuck a little when creating a Configuration File. Is this documentation up to date ? (https://ache.readthedocs.io/en/latest/index.html) How to set Ache to return me a CSV file (Relevant ) without urls set in seed file. It is quite time consuming to find new URLs through 15000+ harvest. My aim is to find databases with scientific publications/books, may you have some suggestion how to increase accuracy? Thank you for your quick support and help! You are doing a great job.

Here is my Configuration File

#
# Example of configuration for running a Focused Crawl
#

# Store pages classified as irrelevant pages by the target page classifier
target_storage.store_negative_pages: true

# Limit the max number of pages crawled per domain, in order to avoid crawling
# too many pages from same somain and favor discovery o new domains
link_storage.max_pages_per_domain: 10000

# Disable "seed scope" to allow crawl pages from any domain
link_storage.link_strategy.use_scope: false

# Set initial link classifier a simple one
link_storage.link_classifier.type: MaxDepthLinkClassifier
link_storage.link_classifier.max_depth: 3
# Train a new link classifier while the crawler is running. This allows
# the crawler automatically learn how to prioritize links in order to
# efficiently locate relevant content while avoiding the retrieval of
# irrelevant content.
link_storage.online_learning.enabled: true
link_storage.online_learning.type: FORWARD_CLASSIFIER_BINARY
link_storage.online_learning.learning_limit: 1000

# Always select top-k links with highest priority to be scheduled
link_storage.link_selector: TopkLinkSelector

# Configure the minimum time interval (in milliseconds) to wait between requests
# to the same host to avoid overloading servers. If you are crawling your own
# web site, you can descrease this value to speed-up the crawl.
link_storage.scheduler.host_min_access_interval: 5000

# Configure the User-Agent of the crawler
crawler_manager.downloader.user_agent.name: ACHE
crawler_manager.downloader.user_agent.url: https://github.com/ViDA-NYU/ache
aecio commented 5 years ago

Yes, the docs are up to date.

Whenever a crawl is started, a web server is started by default on port 8080. ACHE will print the address on the logs. When you open it on the browser, you can see some crawler statistics as well as search the content. The search will only work if you have configured Elasticsearch. The sample configuration in {ACHE_ROOT}/config/config_docker is an example that includes a configuration for elasticsearch using Docker. You can see this to try out the search feature.

Finally, ACHE also stores some TSV files in its output folder. One of the files, relevantpages.csv, includes only the pages classified as relevant by the page classifier provided.