gajus / surgeon

Declarative DOM extraction expression evaluator. 👨‍⚕️
Other
695 stars 30 forks source link
css-selector parser scraper subroutines

Surgeon

GitSpo Mentions Travis build status Coveralls NPM version Canonical Code Style Twitter Follow

Declarative DOM extraction expression evaluator.

Powerful, succinct, composable, extendable, declarative API.

articles:
- select article {0,}
- body:
  - select .body
  - read property innerHTML
  imageUrl:
  - select img
  - read attribute src
  summary:
  - select ".body p:first-child"
  - read property innerHTML
  - format text
  title:
  - select .title
  - read property textContent
pageName:
- select .body
- read property innerHTML

Not succinct enough for you? Use aliases and the pipe operator (|) to shorten and concatenate the commands:

articles:
- sm article
- body: s .body | rp innerHTML
  imageUrl: s img | ra src
  summary: s .body p:first-child | rp innerHTML | f text
  title: s .title | rp textContent
pageName: s .body | rp innerHTML

Have you got suggestions for improvement? I am all ears.


Configuration

Name Type Description Default value
evaluator EvaluatorType HTML parser and selector engine. See evaluators. browser evaluator if window and document variables are present, cheerio otherwise.
subroutines $PropertyType<UserConfigurationType, 'subroutines'> User defined subroutines. See subroutines. N/A

Evaluators

Subroutines use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.

The default evaluator is configured based on the user environment:

Have a use case for another evaluator? Raise an issue.

For an example implementation of an evaluator, refer to:

browser evaluator

Uses native browser methods to parse the document and to evaluate CSS selector queries.

Use browser evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).

import {
  browserEvaluator
} from './evaluators';

surgeon({
  evaluator: browserEvaluator()
});

cheerio evaluator

Uses cheerio to parse the document and to evaluate CSS selector queries.

Use cheerio evaluator if you are running Surgeon in Node.js.

import {
  cheerioEvaluator
} from './evaluators';

surgeon({
  evaluator: cheerioEvaluator()
});

Subroutines

A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.

x('foo | bar baz', 'qux');

In the above example, Surgeon expression uses two subroutines: foo and bar.

foo subroutine is invoked without additional values. bar subroutine is executed with 1 value ("baz").

Subroutines are executed in the order in which they are defined – the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).

Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.

x([
  'foo',
  'bar baz'
], 'qux');

There are two types of subroutines:

Note:

These functions are called subroutines to emphasise the cross-platform nature of the declarative API.

Built-in subroutines

The following subroutines are available out of the box.

append subroutine

append appends a string to the input string.

Parameter name Description Default
tail Appends a string to the end of the input string. N/A

Examples:

// Assuming an element <a href='http://foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | append '/bar'`);

closest subroutine

closest subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.

Note: This is different from the jQuery .closest() in that the latter method does not search for parent descendants matching the selector.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A

constant subroutine

constant returns the parameter value regardless of the input.

Parameter name Description Default
constant Constant value that will be returned as the result. N/A

format subroutine

format is used to format input using printf.

Parameter name Description Default
format sprintf format used to format the input string. The subroutine input is the first argument, i.e. %1$s. %1$s

Examples:

// Extracts 1 matching capturing group from the input string.
// Prefixes the match with 'http://foo.com'.
x(`select a | read attribute href | format 'http://foo.com%1$s'`);

match subroutine

match is used to extract matching capturing groups from the subject input.

Parameter name Description Default
Regular expression Regular expression used to match capturing groups in the string. N/A
Sprintf format sprintf format used to construct a string using the matching capturing groups. %s

Examples:

// Extracts 1 matching capturing group from the input string.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)/"');

// Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');

nextUntil subroutine

nextUntil subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.

Parameter name Description Default
selector expression A string containing a selector expression to indicate where to stop matching following sibling elements. N/A
filter expression A string containing a selector expression to match elements against.

prepend subroutine

prepend prepends a string to the input string.

Parameter name Description Default
head Prepends a string to the start of the input string. N/A

Examples:

// Assuming an element <a href='//foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | prepend 'http:'`);

previous subroutine

previous subroutine selects the preceding sibling.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A

Example:

<ul>
  <li>foo</li>
  <li class='bar'></li>
<ul>
x('select .bar | previous | read property textContent');
// 'foo'

read subroutine

read is used to extract value from the matching element using an evaluator.

Parameter name Description Default
Target type Possible values: "attribute" or "property" N/A
Target name Depending on the target type, name of an attribute or a property. N/A

Examples:

// Returns .foo element "href" attribute value.
// Throws error if attribute does not exist.
x('select .foo | read attribute href');

// Returns an array of "href" attribute values of the matching elements.
// Throws error if attribute does not exist on either of the matching elements.
x('select .foo {0,} | read attribute href');

// Returns .foo element "textContent" property value.
// Throws error if property does not exist.
x('select .foo | read property textContent');

remove subroutine

remove subroutine is used to remove elements from the document using an evaluator.

remove subroutine accepts the same parameters as the select subroutine.

The result of remove subroutine is the input of the subroutine, i.e. previous select subroutine result.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A
Quantifier expression A quantifier expression is used to control the expected result length. See quantifier expression.

Examples:

// Returns 'bar'.
x('select .foo | remove span | read property textContent', `<div class='foo'>bar<span>baz</span></div>`);

select subroutine

select subroutine is used to select the elements in the document using an evaluator.

Parameter name Description Default
CSS selector CSS selector used to select an element. N/A
Quantifier expression A quantifier expression is used to control the shape of the results (direct result or array of results) and the expected result length. See quantifier expression.
Quantifier expression

A quantifier expression is used to assert that the query matches a set number of nodes. A quantifier expression is a modifier of the select subroutine.

A quantifier expression is defined using the following syntax.

Name Syntax
Fixed quantifier {n} where n is an integer >= 1
Greedy quantifier {n,m} where n >= 0 and m >= n
Greedy quantifier {n,} where n >= 0
Greedy quantifier {,m} where m >= 1

A quantifier expression can be appended a node selector [i], e.g. {0,}[1]. This allows to return the first node from the result set.

If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (SelectSubroutineUnexpectedResultCountError) if selector result length is out of the quantifier range.

Examples:

// Selects 0 or more nodes.
// Result is an array.
x('select .foo {0,}');

// Selects 1 or more nodes.
// Throws an error if 0 matches found.
// Result is an array.
x('select .foo {1,}');

// Selects between 0 and 5 nodes.
// Throws an error if more than 5 matches found.
// Result is an array.
x('select .foo {0,5}');

// Selects 1 node.
// Result is the first match in the result set (or `null`).
x('select .foo {0,}[0]');

test subroutine

test is used to validate the current value using a regular expression.

Parameter name Description Default
Regular expression Regular expression used to test the value. N/A

Examples:

// Validates that .foo element textContent property value matches /bar/ regular expression.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | test /bar/');

See error handling for more information and usage examples of the test subroutine.

User-defined subroutines

Custom subroutines can be defined using subroutines configuration.

A subroutine is a function. A subroutine function is invoked with the following parameters:

Parameter name
An instance of [Evaluator].
Current value, i.e. value used to query Surgeon or value returned from the previous (or ancestor) subroutine.
An array of values used when referencing the subroutine in an expression.

Example:

const x = surgeon({
  subroutines: {
    mySubroutine: (currentValue, [firstParameterValue, secondParameterValue]) => {
      console.log(currentValue, firstParameterValue, secondParameterValue);

      return parseInt(currentValue, 10) + 1;
    }
  }
});

x('mySubroutine foo bar | mySubroutine baz qux', 0);

The above example prints:

0 "foo" "bar"
1 "baz" "qux"

For more examples of defining subroutines, refer to:

Inline subroutines

Custom subroutines can be inlined into pianola instructions, e.g.

x(
  [
    'foo',
    (subject) => {
      // `subject` is the return value of `foo` subroutine.

      return 'bar';
    },
    'baz',
  ],
  'qux'
);

Built-in subroutine aliases

Surgeon exports an alias preset is used to reduce verbosity of the queries.

Name Description
ra ... Reads Element attribute value. Equivalent to read attribute ...
rdtc ... Removes any descending elements and reads the resulting textContent property of an element. Equivalent to remove * {0,} | read property ... textContent
rih ... Reads innerHTML property of an element. Equivalent to read property ... innerHTML
roh ... Reads outerHTML property of an element. Equivalent to read property ... outerHTML
rp ... Reads Element property value. Equivalent to read property ...
rtc ... Reads textContent property of an element. Equivalent to read property ... textContent
sa ... Select any (sa). Selects multiple elements (0 or more). Returns array. Equivalent to select "..." {0,}
saf ... Select any first (saf). Selects multiple elements (0 or more). Returns single result or null. Equivalent to select "..." {0,}[0]
sm ... Select many (sm). Selects multiple elements (1 or more). Returns array. Equivalent to select "..." {1,}
smo ... Select maybe one (smo). Selects one element. Returns single result or null. Equivalent to select "..." {0,1}[0]
so ... Select one (so). Selects a single element. Returns single result. Equivalent to select "..." {1}[0].
t {name} Tests value. Equivalent to test ...

Note regarding s ... alias. The CSS selector value is quoted. Therefore, you can write a CSS selector that includes spaces without putting the value in the quotes, e.g. s .foo .bar is equivalent to select ".foo .bar" {1}.

Other alias values are not quoted. Therefore, if value includes a space it must be quoted, e.g. t "/foo bar/".

Usage:

import surgeon, {
  subroutineAliasPreset
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ...subroutineAliasPreset
  }
});

x('s .foo .bar | t "/foo bar/"');

In addition to the built-in aliases, user can declare subroutine aliases.

Expression reference

Surgeon subroutines are referenced using expressions.

An expression is defined using the following pseudo-grammar:

subroutines ->
    subroutines _ "|" _ subroutine
  | subroutine

subroutine ->
    subroutineName " " parameters
  | subroutineName

subroutineName ->
  [a-zA-Z0-9\-_]:+

parameters ->
    parameters " " parameter
  | parameter

Example:

x('foo bar baz', 'qux');

In this example, Surgeon query executor (x) is invoked with foo bar baz expression and qux starting value. The expression tells the query executor to run foo subroutine with parameter values "bar" and "baz". The expression executor runs foo subroutine with parameter values "bar" and "baz" and subject value "qux".

Multiple subroutines can be combined using an array:

x([
  'foo bar baz',
  'corge grault garply'
], 'qux');

In this example, Surgeon query executor (x) is invoked with two expressions (foo bar baz and corge grault garply). The first subroutine is executed with the subject value "qux". The second subroutine is executed with a value that is the result of the parent subroutine.

The result of the query is the result of the last subroutine.

Read user-defined subroutines documentation for broader explanation of the role of the parameter values and the subject value.

The pipe operator (|)

Multiple subroutines can be combined using the pipe operator.

The following examples are equivalent:

x([
  'foo bar baz',
  'qux quux quuz'
]);

x([
  'foo bar baz | foo bar baz'
]);

x('foo bar baz | foo bar baz');

Cookbook

Unless redefined, all examples assume the following initialisation:

import surgeon from 'surgeon';

/**
 * @param configuration {@see https://github.com/gajus/surgeon#configuration}
 */
const x = surgeon();

Extract a single node

Use select subroutine and read subroutine to extract a single value.

const subject = `
  <div class="title">foo</div>
`;

x('select .title | read property textContent', subject);

// 'foo'

Extract multiple nodes

Specify select subroutine quantifier to match multiple results.

const subject = `
  <div class="foo">bar</div>
  <div class="foo">baz</div>
  <div class="foo">qux</div>
`;

x('select .title {0,} | read property textContent', subject);

// [
//   'bar',
//   'baz',
//   'qux'
// ]

Name results

Use a QueryChildrenType object to name the results of the descending expressions.

const subject = `
  <article>
    <div class='title'>foo title</div>
    <div class='body'>foo body</div>
  </article>
  <article>
    <div class='title'>bar title</div>
    <div class='body'>bar body</div>
  </article>
`;

x([
  'select article',
  {
    body: 'select .body | read property textContent'
    title: 'select .title | read property textContent'
  }
]);

// [
//   {
//     body: 'foo body',
//     title: 'foo title'
//   },
//   {
//     body: 'bar body',
//     title: 'bar title'
//   }
// ]

Validate the results using RegExp

Use test subroutine to validate the results.

const subject = `
  <div class="foo">bar</div>
  <div class="foo">baz</div>
  <div class="foo">qux</div>
`;

x('select .foo {0,} | test /^[a-z]{3}$/');

See error handling for information how to handle test subroutine errors.

Validate the results using a user-defined test function

Define a custom subroutine to validate results using arbitrary logic.

Use InvalidValueSentinel to leverage standardised Surgeon error handler (see error handling). Otherwise, simply throw an error.

import surgeon, {
  InvalidValueSentinel
} from 'surgeon';

const x = surgeon({
  subroutines: {
    isRed: (value) => {
      if (value === 'red') {
        return value;
      };

      return new InvalidValueSentinel('Unexpected color.');
    }
  }
});

Declare subroutine aliases

As you become familiar with the query execution mechanism, typing long expressions (such as select, read attribute and read property) becomes a mundane task.

Remember that subroutines are regular functions: you can partially apply and use the partially applied functions to create new subroutines.

Example:

import surgeon, {
  readSubroutine,
  selectSubroutine,
  testSubroutine
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ra: (subject, values, bindle) => {
      return readSubroutine(subject, ['attribute'].concat(values), bindle);
    },
    rp: (subject, values, bindle) => {
      return readSubroutine(subject, ['property'].concat(values), bindle);
    },
    s: (subject, values, bindle) => {
      return selectSubroutine(subject, [values.join(' '), '{1}'], bindle);
    },
    sm: (subject, values, bindle) => {
      return selectSubroutine(subject, [values.join(' '), '{0,}'], bindle);
    },
    t: testSubroutine
  }
});

Now, instead of writing:

articles:
- select article
- body:
  - select .body
  - read property innerHTML

You can write:

articles:
- sm article
- body:
  - s .body
  - rp innerHTML

The aliases used in this example are available in the aliases preset (read built-in subroutine aliases).

Error handling

Surgeon throws the following errors to indicate a predictable error state. All Surgeon errors can be imported. Use instanceof operator to determine the error type.

Note:

Surgeon errors are non-recoverable, i.e. a selector cannot proceed if it encounters an error. This design ensures that your selectors are capturing the expected data.

Name Description
ReadSubroutineNotFoundError Thrown when an attempt is made to retrieve a non-existent attribute or property.
SelectSubroutineUnexpectedResultCountError Thrown when a select subroutine result length does not match the quantifier expression.
InvalidDataError Thrown when a subroutine returns an instance of InvalidValueSentinel.
SurgeonError A generic error. All other Surgeon errors extend from SurgeonError.

Example:

import {
  InvalidDataError
} from 'surgeon';

const subject = `
  <div class="foo">bar</div>
`;

try {
  x('select .foo | test /bar/', subject);
} catch (error) {
  if (error instanceof InvalidDataError) {
    // Handle data validation error.
  } else {
    throw error;
  }
}

Return InvalidValueSentinel from a subroutine to force Surgeon throw InvalidDataError error.

Debugging

Surgeon is using roarr to log debugging information.

Export ROARR_LOG=TRUE environment variable to enable Surgeon debug log.