alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

proxy handler for memorious #125

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi guys,

Hope you are all well !

Is it possible to setup a proxy (socks5) for memorious crawlers ?

Cheers, X

sunu commented 4 years ago

Hi, you can use the session op to make memorious use a proxy for subsequent requests. See https://memorious.readthedocs.io/en/latest/buildingcrawler.html#session

ghost commented 4 years ago
# Scraper for the OCCRP web site.
# The goal is not to download all HTML, but only PDFs & other documents
# linked from the page as proof.
name: fmc

# A title for display in the UI:
description: 'FMC Organization'

# Uncomment to run this scraper automatically:
# schedule: weekly
pipeline:

  session:
    user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
    proxy: "socks5://localhost:5566"

  init:
    # This first stage will get the ball rolling with a seed URL.
    method: seed
    params:
      urls:
        - http://www.fmc.com/
    handle:
      pass: fetch

  fetch:
    # Download the seed page
    method: fetch
    params:
      # These rules specify which pages should be scraped or included:
      rules:
        and:
          - domain: fmc.com
    handle:
      pass: parse

  parse:
    # Parse the scraped pages to find if they contain additional links.
    method: parse
    params:
      # Additional rules to determine if a scraped page should be stored or not.
      # In this example, we're only keeping PDFs, word files, etc.
      store:
        or:
          - mime_group: archives
          - mime_group: documents
    handle:
      store: store
      # this makes it a recursive web crawler:
      fetch: fetch

  store:
    # Store the crawled documents to a directory
    method: directory
    params:
      path: /data/results

does it sound correct for you if written like that?

sunu commented 4 years ago

Hi, I have added an example for proxy configuration here: https://memorious.readthedocs.io/en/latest/buildingcrawler.html#session

You want something like this

name: fmc

description: 'FMC Organization'

pipeline:

  init:
    method: session
    params:
      user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
      proxy: "socks5://localhost:5566"
    handle:
      pass: seed

  seed:
    # This first stage will get the ball rolling with a seed URL.
    method: seed
    params:
      urls:
        - http://www.fmc.com/
    handle:
      pass: fetch
...

The first stage of the pipeline has to be named init unless explicitly overridden.

And you may need to install PySocks as an extra dependency to use socks5 proxies for now. We'll include it in memorious by default from the next release.

ghost commented 4 years ago

Awesome thanks a lot !

sunu commented 4 years ago

Closing this since the issue has been addressed now