ThaNarie / html-extract-data

Extract data from the DOM using a JSON config
MIT License
4 stars 0 forks source link

Feature: Custom extractor chaining with valid json. #13

Open pbreah opened 2 years ago

pbreah commented 2 years ago

@ThaNarie

Thank you for sharing your very nice lib. It helps to have this nice configurable data extractor.

I want to propose working on a feature that would take it to the next level - if you would accept a pull request. It's about having the ability to add your own pre-defined extractors and have the ability to extend it - without adding inline functions on the json configuration.

The reason is simple: I can add my extractor functions then simply provide a valid json structure that uses my custom extractor functions in a powerful way.

This is the idea:

import extractFromHTML from 'html-extract-data';
// immediately add my custom extractors
extractFromHTML.addExtractors({
  // p1 and p2 are just my extractor's custom parameters
  'myCustom': (extract, element, p1, p2) => {
    const d = extract({ query: '.js-image', attr: 'alt' });
    if (p1 === 'myTest1') {
      // do logic 1 to d
    }
    if (p2 === 'myTest2') {
      // do logic 2 to d
    }
    // etc.
    return d;
  },
  // on this case my custom extractor has just one parameter (p1)
  'myCustom2': (extract, element, p1) => {
   let d;
    if (p1) {
      // data received from output of preview function or explicit parameter
      d = p1.indexOf('dummy-data') !== -1 ? 'found': 'not-found';
    }
    return d;
  }
});

// at a later stage I pass a pure JSON config
const data = extractFromHTML(
  html,
  {
    query: '.grid-item',
    list: true,
    self: {
      'category': 'data-category',
      'id': { attr: 'data-id', convert: 'number' },
    },
    data: {
      title: 'h2',
      description: { query: 'p', html: true },
      tags: { query: '.tags > .tag', list: true },    
      price: { query: '.price', convert: parseFloat },
      date: { query: '.date', convert: 'date' },
      // myCustom is executed with the 1st parameter of true and 2nd with false, then the output of myCustom is used to run myCustom2
      image: 'myCustom true false | myCustom2'
    }
  });

As you can see this is a simple but powerful feature that uses pure JSON and my own extractors in a powerful chain - the output of one extractor into another. This could also work to "pipe" data to existing functions you added (like convert) so users can create powerful string expressions with existing features.

Would you accept a pull request with a feature like this?

Thanks,

ThaNarie commented 2 years ago

Hi @pbreah!

Thanks for your interrest and thought out feature request! Unfortunately, it goes the opposite direction of where I want this to go.

Where want to make it more "code" to make sure TypeScript can help the developer better when configuring, you seem to want to go the "no-code" direction, even calling it a "json configuration" – which it clearly isn't :)

However, there might be a way to make it work as an optional element, something like:

const data = extractFromHTML(
  html,
  configFromJson(config, options),
)

Where config is your proposed JSON config, and options can at least contain the custom extractors that the function will be using. configFromJson would then convert your json-like config to something that does use functions and the library can already work with.

Doing it like that doesn't even require any change to the library, as it's something you can create yourself and design how you see fit. But since it can be tree-shaken out if it's not used, I'm not opposed to include it in here for other people to use once you finish your work :)

How does this sound to you?