andykais / scrape-pages

generalized scraper using a single instruction set for any site that can be statically scraped
https://scrape-pages.js.org
MIT License
6 stars 2 forks source link

Flatten config structure #33

Closed andykais closed 4 years ago

andykais commented 4 years ago

an alternative structure for the config, one that is more flat, less nested.

const config = {
  flow: [
    scraper1,
    scraper2,
    ...
  ]
}
andykais commented 4 years ago

theres an important difference in the data flow. Before, data only flew RIGHT and DOWN (scrapeEach array and scrapeEach). Now, there is a third direction. "DIAGONAL-LEFT". E.g. data can hit a branch, then merge back into the downward stream. Its causing a pretty big change in the sql ordering so I want to know if its worth it. The real world use case is, we have multiple places where a single type of data needs to be downloaded/parsed. E.g. traverse several different pages to get all the image links on a site. It could be important to be able to grab all those scrapers at once with the querier.

What I do like about this: its a really intuitive syntax, easy to wrap your head around. E.g.

[
  {
    scraper: gallery,
    branch: [[pageA], [pageB]]
  },
  downloadImage
]

what I dont like about this: the sql ordering isnt that smart. Before I could assume that I was dealing with a tree data structure. Now Im dealing with a one way graph. Its much harder to reason about order in this case

gallery - pageA
       \        \
        pageB -  downloadImage