VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
450 stars 135 forks source link

Question about configuration for using cookies on deep web site #171

Closed pkoloveas closed 2 years ago

pkoloveas commented 5 years ago

I have the following ache.yml configuration file to crawl a deep web site with authentication:

target_storage.data_formats:
 - FILESYSTEM_HTML

target_storage.data_format.filesystem.compress_data: false

link_storage.link_strategy.use_scope: true
link_storage.link_strategy.outlinks: true
link_storage.scheduler.host_min_access_interval: 2000

crawler_manager.downloader.torproxy:  http://torproxy:8118

crawler_manager.downloader.cookies:
 - domain: http://qf7krcylpc4iswho.onion
   cookie: MARKET_SESSION=3d47ifekp8jd692t37svml3tn0

crawler_manager.downloader.user_agent.string: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0

While the crawler is successful on extracting all the links on the login page, it doesn't actually authenticate using the cookie so I am not able to crawl within the website. Am I supposed to add something on the particular url in the seeds file to indicate the cookie? Do I need to add something in the yml file to complete the configuration? Does the cookie feature work on deep web sites?

aecio commented 5 years ago

Trying to answer some of the questions:

That being said, the cookies feature did not work in 100% of the sites we tested. Apparently, there are some specific sites that detect that something is different from the browser request and do not authorize it. Unfortunately, it is hard to debug such cases. It is needed to find what is different between the browser and the crawler requests to know why it is not working.

pkoloveas commented 5 years ago

Thank you very much for your answer.

I have one more question regarding the matter: Has the cookies feature been tested with logins accompanied by CAPTCHA?

aecio commented 5 years ago

I'm not sure, we may have done that. But I think it should work as well.