andykais / scrape-pages

generalized scraper using a single instruction set for any site that can be statically scraped
https://scrape-pages.js.org
MIT License
6 stars 2 forks source link

Support module scrapers #12

Open andykais opened 5 years ago

andykais commented 5 years ago

The dream here is to let other users maintain scrapers in a community repo, or on their own githubs, and let developers simply install them via npm.

npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

ConfigInit:

scrape:
  module: '@community-scrapers/twitter-feed'

yields Config:

input:
  - '@community-scrapers/twitter-feed:username'
define:
  @community-scrapers/twitter-feed:feedpage: ...
  @community-scrapers/twitter-feed:post: ...
  @community-scrapers/twitter-feed:post-media: ...
scrape:
  module: '@community-scrapers/twitter-feed'

Local define defs can override those inside module define.

How to wire this stuff up?

inputs

Create a object in each ScrapeStep that came from a module. Object should map full input keys to module's internal keys. The internal keys will be the ones actually used in the handlebar templates. E.g.

{
  '@community-scrapers/twitter-feed:username': 'username'
}

scrape

Two options:

  1. Create a separate flow.ts instance for a module and hook that up to whatever is above/below it.
  2. Crawl through a module scraper, find all empty scrapeEach arrays and reattach the rest of the structure there.

stateful values

There may be times when a local/module scraper gets a value that you want for the rest of the run. Most often this will be an auth/access token.

define:
  'user-likes-page':
    download:
       urlTemplate: 'https://twitter.com/likes'
       headerTemplates:
         'x-twitter-access-token': '{{ accessToken }}'
    parse:
      selector: '.post a'
      attribute: 'href'
scrape:
  module: '@community-scrapers/twitter-login'
  valueAsInput: 'accessToken'
  forEach:
    - scraper: 'user-likes-page'

This is essentially global state, whenever '@community-scrapers/twitter-login' gives us a value, we update the input value for 'accessToken', and replace the passed down value with ''

organizing dependencies

It is possible to have a separate directory where module scrapers live using worker_threads.

mkdir scrape-pages-runners
cd scrape-pages-runners
npm init
npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

Your main nodejs process can run something like

const { Worker } = require('worker_threads')
const worker = new Worker('./scrape-pages-runners/worker.js', { workerData: { config, options } })
worker.on('message', ([event, data]) => console.log(event, data)) // wire up scraper events here
worker.on('exit', () => console.log('complete.')
andykais commented 5 years ago

for now, this is a back-burner issue. The biggest use case was to reuse login logic for different scrapers.

If I see a pressing reason, I will implement it, until then I will encourage the community to build full independent configs & options.

sample consumable scraper:

example-scraper/
  package.json
  config.json
  options.json
  readme.md