Closed ghost closed 4 years ago
Hi, you can use the session op to make memorious use a proxy for subsequent requests. See https://memorious.readthedocs.io/en/latest/buildingcrawler.html#session
# Scraper for the OCCRP web site.
# The goal is not to download all HTML, but only PDFs & other documents
# linked from the page as proof.
name: fmc
# A title for display in the UI:
description: 'FMC Organization'
# Uncomment to run this scraper automatically:
# schedule: weekly
pipeline:
session:
user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
proxy: "socks5://localhost:5566"
init:
# This first stage will get the ball rolling with a seed URL.
method: seed
params:
urls:
- http://www.fmc.com/
handle:
pass: fetch
fetch:
# Download the seed page
method: fetch
params:
# These rules specify which pages should be scraped or included:
rules:
and:
- domain: fmc.com
handle:
pass: parse
parse:
# Parse the scraped pages to find if they contain additional links.
method: parse
params:
# Additional rules to determine if a scraped page should be stored or not.
# In this example, we're only keeping PDFs, word files, etc.
store:
or:
- mime_group: archives
- mime_group: documents
handle:
store: store
# this makes it a recursive web crawler:
fetch: fetch
store:
# Store the crawled documents to a directory
method: directory
params:
path: /data/results
does it sound correct for you if written like that?
Hi, I have added an example for proxy configuration here: https://memorious.readthedocs.io/en/latest/buildingcrawler.html#session
You want something like this
name: fmc
description: 'FMC Organization'
pipeline:
init:
method: session
params:
user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
proxy: "socks5://localhost:5566"
handle:
pass: seed
seed:
# This first stage will get the ball rolling with a seed URL.
method: seed
params:
urls:
- http://www.fmc.com/
handle:
pass: fetch
...
The first stage of the pipeline has to be named init
unless explicitly overridden.
And you may need to install PySocks as an extra dependency to use socks5 proxies for now. We'll include it in memorious by default from the next release.
Awesome thanks a lot !
Closing this since the issue has been addressed now
Hi guys,
Hope you are all well !
Is it possible to setup a proxy (socks5) for memorious crawlers ?
Cheers, X