VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Crawler Execution Failed #207

Closed fatimasadiq closed 2 years ago

fatimasadiq commented 3 years ago

Hi I'm new to ACHe crawler and trying to run the sample to see how the crawler is collecting data then i can run myown bnut its giving me below error. I'm running on centos7 with docker.

Please help.

image

fatimasadiq commented 3 years ago

Hi now im getting attached while running the crawler nothing is downloaded.

Screenshot 2021-07-06 at 14 49 03
aecio commented 3 years ago

For the first problem, you were probably configuring the docker volume at the wrong directory, but you seem to have already fixed it.

For the second screenshot, the crawling is ignoring non-english pages by default. You can disable this feature by adding the following on the ache.yml file:

# Store only pages that contain english text using language detector
target_storage.english_language_detection_enabled: false

The sample config file at https://github.com/VIDA-NYU/ache/blob/master/config/sample_config/ache.yml has other configurations that my be useful.

aecio commented 3 years ago

The crawler also ignores non-HTML content by default (e.g., jpg images as seen in the log). To allow other types of content, you need to add the following config on ache.yml (including other mime-types that you need):

crawler_manager.downloader.valid_mime_types:
 - text/xml
 - text/html
 - text/plain
 - application/x-asp
 - application/xhtml+xml
 - application/vnd.wap.xhtml+xml
fatimasadiq commented 3 years ago

Dear Aecio,

Thank you for the response. Let me try this and I will come back to this thread. so please don't close it.

aecio commented 2 years ago

Closing this issue. Feel free to open another issue if you find other problems.