denkan / cheerio-json-mapper

MIT License
18 stars 4 forks source link

Cheerio JSON Mapper

Extract HTML markup to JSON using Cheerio.


License: MIT npm version

Install

# npm
npm i -S cheerio-json-mapper

# yarn
yarn add cheerio-json-mapper

Usage

import { cheerioJsonMapper } from 'cheerio-json-mapper';

const html = `
    <article>
        <h1>My headline</h1>
        <div class="content">
            <p>My article text.</p>
        </div>
        <div class="author">
            <a href="https://github.com/denkan/cheerio-json-mapper/blob/master/mailto:john.doe@example.com">John Doe</a>
        </div>
    </article>
`;

const template = {
  headline: 'article > h1',
  articleText: 'article > .content',
  author: {
    $: 'article > .author',
    name: '> a',
    email: '> a | attr:href | substr:7',
  },
};

const result = await cheerioJsonMapper(html, template);
console.log(result);
// output:
// {
//     headline: "My headline",
//     articleText: "My article text.",
//     author: {
//         name: "John Doe",
//         email: "john.doe@example.com"
//     }
// }

More examples are found in the repo's tests/cases folder.

Core concepts

End-Result Structure First

The main approach is to start from what we need to retrieve. Defining the end structure and just telling each property which selector to use to get its value.

Hard-coded values (literals)

We can set hard values to the structure by wrapping strings in quotes or single-quotes. Numbers and booleans are automatically detected as literals:

{
  "headline": "article > h1",
  "public": true,
  "copyright": "'© Copyright Us Inc. 2023'",
  "version": 1.23
}

Scoping

Large documents with nested parts tend to require big and ugly selectors. To simplify things, we can scope an object to only care for a certain selected part.

Add a $ property with selector to narrow down what the rest of the object should use as base.

Example:

<article>
  <h1>My headline</h1>
  <div class="content">
    <p>My article text.</p>
  </div>
  <div class="author">
    <span class="name">John Doe</span>
    <span class="tel">555-1234</span>
    <a href="https://github.com/denkan/cheerio-json-mapper/blob/master/mailto:john.doe@example.com">John Doe</a>
  </div>
  <div class="other">
    <span class="name">This wont be selected due to scoping</span>
  </div>
</article>
const template = {
  $: 'article',
  headline: '> h1',
  articleText: '> .content',
  author: {
    $: '> .author',
    name: 'span.name',
    telephone: 'span.tel',
    email: 'a[href^=mailto:] | attr:href | substr:7',
  },
};

Self-selector

In some cases we want to reuse the object selector ($) for a property selector. Especially handy when targeting lists, e.g. this case:

const html = `
  <ul>
    <li>One</li>
    <li>Two</li>
    <li>Three</li>
  </ul>
`;
const template = [
  {
    $: 'ul > li',
    value: '$', // uses `ul > li` as property selector
  },
];
const result = await cheerioJsonMapper(html, template);
console.log(result);
// Output:
// [
//   { value: 'One' },
//   { value: 'Two' },
//   { value: 'Three' }
// ];

Note: Don't like the $ name for scope selector? Change it through options: cheerioJsonMapper(html, template, { scopeProp: '__scope' }):

Pipes

Sometimes the text content of a selected node is not what we need. Or not enough. Pipes to rescue!

Pipes are functionality that can be applied to a value - both a property selector and an object. Use pipes to handle any custom needs.

Multiple pipes are supported (seperated by | char) and will run in sequence. Do note that value returned from a pipe will be passed to next pipe, allowing us to chain functionality (kind of same way as *nix terminal pipes, which was the inspiration to this syntax).

Pipes can have basic arguments by adding colon (:) along with semi-colon (;) seperated values.

Pipes can by asynchronous.

Use pipes in selector props:

{
  email: 'a[href^=mailto:] | attr:href | substr:7';
}

Use pipes in objects:

{
    name: 'span.name',
    email: 'a[href^=mailto:] | attr:href | substr:7',
    telephone: 'span.tel',
    '|': 'requiredProps:name;email'
}

Note: Don't like the | name for pipe property? Change it through options: cheerioJsonMapper(html, template, { pipeProp: '__pipes' }):

Default pipes included:

Custom pipes

Create your own pipes to handle any customization needed.

const customPipes = {
  /** Replace any http:// link into https:// */
  onlyHttps: ({ value }) => value?.toString().replace(/^http:/, 'https:'),

  /** Check if all required props exists - and if not, set object to undefined  */
  requiredProps: ({ value, args }) => {
    const obj = value; // as this should be run as object pipe, value should be an object
    const requiredProps = args; // string array
    const hasMissingProps = requiredProps.some((prop) => obj[prop] == null);
    return hasMissingProps ? undefined : obj;
  },
};

const template = [
  {
    name: 'span.name',
    telephone: 'span.tel',
    email: 'a[href^=mailto:] | attr:href | substr:7',
    website: 'a[href^=http] | attr:href | onlyHttps',
    '|': 'requiredProps:name;email',
  },
];

const contacts = await cheerioJsonMapper(html, template, { pipeFns: customPipes });

Examples

More examples are found in the repo's tests/cases folder.

Change Log

v1.0.4 - 2024-10-18

v1.0.3 - 2023-04-11

v1.0.2 - 2023-04-04

v1.0.1 - 2023-03-28

v1.0.0 - 2023-03-28