Restrict expressiveness of site adapters

I just had a nice chat with David Karger about Wildcard at the HCI research feedback lunch. He made a bunch of useful points about things to expand on in the Onward paper, but I found one of them particularly salient: restricting the expressiveness of site adapters. This is a topic that has come up before, but as we move towards a beta release and soliciting contributions of adapters, it's seeming increasingly important to discuss.

Why Javascript is problematic

Currently site adapters are written in Javascript (typed with Typescript). You write a single scrapePage function that returns all the data, and inside that function you can do whatever you want. In earlier versions of Wildcard I explored different more complicated APIs, but landed there for simplicity's sake.

In previous discussions we've briefly touched on the security concerns of having a community-sourced repository of scrapers that can execute arbitrary code. The tentative plan up to now was the following: have people contribute adapters back to the main Github repository, do centralized code review by the core developers, and then distribute adapters in the code along with the extension itself. That plan somewhat solves the security issue, but still has at least 3 remaining problems, in order of priority:

1) Burden on users: contributing back site adapters has a high barrier to entry -- you need to install the development build system locally, write Javascript, submit a Github PR, etc... as I was writing the site adapter creation guide docs, I started to get nervous about this. 2) Too many footguns: people have lots of room to mess up, especially if inexperienced programmers are writing adapters. It's harder for us to enforce patterns of building adapters that are robust. 3) Mediocre distribution mechanism: Centralized code review is a bottleneck and still doesn't provide airtight security. Only shipping new adapters with new versions of the extension code will require frequent releases and getting all users to upgrade. It would be much preferable to be able to distribute adapters dynamically, independent of extension code releases.

The obvious solution here is to move away from Javascript as the scraper language, to a more restrictive and declarative DSL / "configuration language" . This solves all the problems:

1) Easier to write an adapter -- you can open an adapter editor inside of Wildcard, save new adapters in some serialized format, and upload them to a website that collects people's adapters. Don't need to write JS. 2) You no longer have enough expressiveness to write certain kinds of bugs, and you can't write malware (assuming the DSL is well designed). 3) Distribution becomes way simpler: have some online collection of adapters; Wildcard can either download all the latest ones or you can download specific adapters.

Some drawbacks might be: a) harder to use for people who already know JS, b) providing a good editing experience might be more work, since we can't lean on Typescript types anymore and would need to do our own static verification of the adapter code

DSL design

OK, sounds great, but the tricky part is designing a DSL that can still usefully scrape sites with reasonable programmer ergonomics. Looking across the site adapters we have now, it seems clear that the basic building blocks of an HTML scraper are:

1) CSS selectors, for locating relevant DOM elements. 2) getting attributes of DOM elements 3) regex, for extracting substrings. Could also consider end-user-friendlier languages with regex-equivalent power, but regex is a universal standard, and there are some nice regex-generation tools. The framework could provide helpers for standard regex operations (eg extract number from string)

What else is currently used in adapters? A quick audit:

Math: The Amazon scraper does some math to sum up delivery costs. I think this is better done as a formula outside the scraper, to give the end user more control. More generally, I'm optimistic that we can push computation out of the scrapers and into formulas in the table. If a site adapter could generate a set of "raw columns" extracted from the page, and then provide a "derived view" using formulas in the table, giving the end user maximum flexibility.
Conditionals: used in several adapters for various purposes. I think if we had a feature for fallbacks (try to scrape X; if it's not there, scrape Y, ...) that would eliminate many of them, but not all.
Iteration: the Flux adapter uses a for loop because parts of the CSS selectors are dependent on the row and the column; could probably design a way to directly interpolate row+column numbers into css selectors without requiring iteration

A few other thoughts:

xpath: I'm not super familiar with it but it seems incredibly powerful, potentially the perfect existing language for providing most/all of these features in one package. I thought it was similar to CSS in power but it seems quite a bit more expressive.

AJAX responses: We're also starting to explore AJAX adapters that scrape from an AJAX JSON response. So there'd need to be another way besides CSS selectors to index into a data tree -- maybe xpath.

Other adapter attributes besides scraping: Adapters also define other attributes besides scraping: a name, a set of columns, when to activate based on URL, DOM events to trigger data reloads... but most of those are all pretty declarative already.

Syntax: I'm reluctant to design a syntax from scratch; embedding this in JSON seems most straightforward. Usually I prefer DSLs embedded in a Turing-complete language to provide the TC escape hatch if needed, but here that's precisely what we don't want.

Example

Here's a concrete example of how such a DSL might look in a simple case, Airbnb's search page.

First, the existing Javascript adapter:

'use strict';

import { urlContains, extractNumber } from "../utils"
import { createDomScrapingAdapter } from "./domScrapingBase"

const rowContainerClass = "_fhph4u"
const rowClass = "_8ssblpx"
const titleClass = "_1c2n35az"
const priceClass = "_1p7iugi"
const ratingClass = "_10fy1f8"
const listingLinkClass = "_i24ijs"

const AirbnbAdapter = createDomScrapingAdapter({
  name: "Airbnb",
  enabled: () => urlContains("airbnb.com/s"),
  attributes: [
  { name: "id", type: "text" },
  { name: "name", type: "text" },
  { name: "price", type: "numeric" },
  { name: "rating", type: "numeric" }
  ],
  scrapePage: () => {
    return Array.from(document.getElementsByClassName(rowClass)).map(el => {
      let path = el.querySelector("." + listingLinkClass).getAttribute('href')
      let id = path.match(/\/rooms\/([0-9]*)\?/)[1]

      return {
        id: id,
        rowElements: [el],
        dataValues: {
          name: el.querySelector(`.${titleClass}`),
          price: el.querySelector(`.${priceClass}`).textContent.match(/\$([\d]*)/)[1],
          rating: extractNumber(el.querySelector(`.${ratingClass}`))
        }
      }
    })
  }
});

export default AirbnbAdapter;

Then, the new adapter in our imagined DSL:

{
  "name": "Airbnb",
  "enabled": {
    "urlContains": "airbnb.com/s"
  },
  "attributes": [
    { "name": "id", "type": "text" },
    { "name": "name", "type": "text" },
    { "name": "price", "type": "numeric" },
    { "name": "rating", "type": "numeric" }
  ],
  // CSS class identifying each row.
  // (todo: consider cases like Hacker News where each row
  // is spread across multiple DOM elements)
  "rows": "_fhph4u",
  "id": {
     // from within the row, get element with this class...
     "querySelector": "._i24ijs",
     // extract this attribute from that element...
     "attribute": "href",
     // then run this regex and get the first match.
     // (getting the first match is just the default behavior)
     "extract": { "regex": "/\/rooms\/([0-9]*)\?/" }
   },
  "values": {
    "querySelector": "._1c2n35az",
    "price": {
      "css": "._1p7iugi",
      "extract": { "regex": "/\$([\d]*)/" }
    },
    "rating": {
      "css": "_10fy1f8",
      "extract": "number"
    }
  }
}

Visual editing

Eventually it would be good to have a visual environment where end users can generate scrapers via direct manipulation, and there are some existing tools for doing that. One nice thing about this DSL approach is that it should be an easier code generation target for such a tool.

Initially, to limit scope, I'm imagining that users would directly edit this DSL in text. (Although -- if there's an existing end user scraper creation tool that's really good, maybe we could bypass text editing entirely and just use that tool instead...)

Prior art

People have designed many DSLs and visual scraping products before. If one of them fits our purposes (and ideally, is popular) then that would be great.

The ideal option would have:

a clear, easily accessible artifact representing the scraping logic, which we could directly use or easily compile to our preferred format
an accessible visual editor
free, open source if possible
Helena
Pickaxe
Ferret
https://webscraper.io/
https://data-miner.io/

Some language design inspiration from the Huginn web scraping agent's scraping configuration:

          "extract": {
            "url": { "css": "#comic img", "value": "@src" },
            "title": { "css": "#comic img", "value": "@title" },
            "body_text": { "css": "div.main", "value": "string(.)" },
            "page_title": { "css": "title", "value": "string(.)", "repeat": true }
          }
      or
          "extract": {
            "url": { "xpath": "//*[@class='blog-item']/a/@href", "value": ".",
            "title": { "xpath": "//*[@class='blog-item']/a", "value": "normalize-space(.)" },
            "description": { "xpath": "//*[@class='blog-item']/div[0]", "value": "string(.)" }
          }

Next steps

Unfortunately removing expressiveness from existing programs is hard. If people were to start contributing Javascript adapters, it wouldn't always be easily possible to convert them to this less expressive form.

I'm tempted to say that we should think through this issue before doing the planned work of writing up a site adapter creation guide + soliciting scraper contributions. It would be ideal to have something about this in the Onward paper as well. Unfortunately this may not be a quick thing to resolve; DSL design is hard.

One helpful technique would be to approach this incrementally, by supporting both the new DSL and Javascript adapters. Start with a tiny DSL that can handle the simplest cases, migrate some existing adapters over, and then encourage that for new adapters. Some of our current adapters may need to stay in JS for now, and there may be new JS adapters still in the future, but as long as most adapters are in the simple format, that will still get us many of the benefits outlined above.

Below is my attempt at the start of a DSL. I think there are a lot of kinks to be worked out if we would continue down this path, but it essentially covers try's and sequential statements. I'm not sure how functions would be implemented. I'm also not sure how much more intuitive it is over Javascript, but would restrict expressiveness. This doesn't cover the reload or styling because that would require some more thought, but thinking about those exposes potential flaws in the design (setting variables or if statements perhaps, but maybe I need to think about them more). Anyways, this is just the idea I had last week more formalized.

const YoutubeAdapter = createDomScrapingAdapter({
    name: "YouTube",
    enabled: () => {
        return urlContains("youtube.com")
    },
    attributes: [
        { name: "id", type: "text", hidden: true },
        { name: "Title", type: "text" },
        { name: "Time", type: "text"},
        { name: "Uploader", type: "text"},
        { name: "% Watched", type: "numeric"}
    ],
    scrapePage: () => {
        let tableRows = document.querySelector('#contents').children;
        return Array.from(tableRows).map((el, index) => {
            let elAsHTMLElement : HTMLElement = <HTMLElement>el;

            if(el.querySelector('#video-title-link') !== null && el.querySelector('#overlays') != null && el.querySelector('#overlays').children[0] != null){

                let overlayChildrenAmount = el.querySelector('#overlays').children.length;
                let timeStampExists = overlayChildrenAmount > 1 && el.querySelector('#overlays').children[overlayChildrenAmount - 2].children[1] !== undefined;
                let timeStamp = timeStampExists
                    ? el.querySelector('#overlays').children[overlayChildrenAmount - 2].children[1].textContent.replace((/  |\r\n|\n|\r/gm),"")
                    : "N/A";
                let watchedPercentage = el.querySelector('#progress') !== null
                    ? progressToNumber((el.querySelector('#progress') as HTMLElement).style.width)
                    : 0;

                return {
                    rowElements: [elAsHTMLElement],
                    id: el.querySelector('#video-title-link').getAttribute("href"),
                    dataValues: {
                        Title: el.querySelector('#video-title'),
                        Time: timeStamp,
                        Uploader: el.querySelector('#text').children[0],
                        '% Watched': watchedPercentage,
                    },
                }
            }
            else
            {
                return null;
            }

        }).filter(el => el !== null)
    },
});

{
  "name": "YouTube",
  "enabled": {
    "urlContains": "youtube.com"
  },
  "attributes": [
    { "name": "id", "type": "text", hidden: true },
    { name: "Title", type: "text" },
    { name: "Time", type: "text"},
    { name: "Uploader", type: "text"},
    { name: "% Watched", type: "numeric"}
  ],
  //rows variable operates off of the document. All future variables work off of "el" from rows
  "rows": {
    //Try statement shows what should be done. Array shows the order in which things should occur
    //Even indices indicate the action with the next element specifying the parameters
    //getProperty is used when there are no parameters
    //So this would call el.querySelector("#contents").children
    "try": [
        "querySelector", "#contents",
        "getProperty", "children"
      ],
  },
  //Can specify additional variables as so, still under each row or "el"
  "overlayChildrenAmount": {
    "try": [
        "querySelector", "#overlays",
        "getProperty", "children",
        "getProperty", "length"
      ]
  },
  "id": {
     "try": [
        "querySelector", "#video-title-link",
        "getAttribute", "href"
       ]
   },
  "values": {
    "Title": {
      "try": [
          "querySelector", "#video-title"
        ]
    }
    "Time": {
      //Replace is an interesting case of this new form.
      //It has two parameters, so they are specified in an array
      "try": [
          "querySelector", "#overlays",
          "children", overlayChildrenAmount - 2,
          "children", 1,
          "getProperty", "textContent",
          "replace", [(/  |\r\n|\n|\r/gm), ""]
        ],
      //Catch statement is what is returned instead if the try fails.
      //If no catch specified, returns null if try fails
      "catch": "N/A"
    },
    "Uploader": {
      "try": [
          "querySelector", "#text",
          "children", 0
        ]
    },
    "% Watched": {
      "try": [
          "querySelector", "#progress",
          "getProperty", "style",
          "getProperty", "width"
        ],
      "catch": 0
    }
  }
}

geoffreylitt / wildcard