calebsander / html-soup

A Node.JS package to do some basic HTML parsing and CSS selection
1 stars 0 forks source link

Support direct attribute access, innerText, and dataset properties #2

Closed Iiridayn closed 5 years ago

Iiridayn commented 5 years ago

I used something like this to get the attributes I needed (getInnerText redacted as it's currently returning wrong values):

function proxifyNode(node) {
    let wrappedChildren = false;
    return new Proxy(node, {
        get: function(target, property, receiver) {
            // children must be wrapped!
            if (property === 'children' && !wrappedChildren) {
                wrappedChildren = true;
                target.children = target.children.map(child => {
                    if (child.constructor.name === 'HtmlTag')
                        return proxifyNode(child);
                    return child;
                });
            }

            // prefer normal properties first
            if (property in target)
                return target[property];

            if (property === 'innerText')
                return getInnerText(target);

            if (property === 'dataset') {
                const dataset = new Proxy(target, {
                    get: function(target, property, receiver) {
                        return target.attributes['data-' + property.replace(/([A-Z])/g, (match, p1) => '-' + p1.toLowerCase())];
                    }
                });
                // set as property so I only build the dataset Proxy once
                target[property] = dataset;
                return dataset;
            }

            // check for attributes after the special properties
            if (property in target.attributes)
                return target.attributes[property];
        },
    });
}

And call it as:

const dom = proxifyNode(htmlSoup.parse(body));

Having support for this on the library level would be great. My usecase - I'm sharing scraping code between the server (node.js) and a browser extension - I've also wrapped htmlSoup.select to behave like document.querySelector and document.querySelectorAll as my scrapper code has been using those two functions.

Iiridayn commented 5 years ago

Got getInnerText working well enough for me - see https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute if you'd like the gory details of what I skipped.

// doesn't handle table rows or cells, or apply css - should be adequate though
function getInnerText(node) {
    if (node.type === 'br')
        return '\n';

    let text = '';
    node.children.forEach(child => {
        if (child.constructor.name === 'HtmlTag') {
            text += getInnerText(child);
        } else if (child.constructor.name === 'TextNode') {
            text += child.text;
        } else {
            console.warn('Unhandled type', child.constructor.name);
        }
    });

    if (node.type === 'p')
        return '\n' + text + '\n';
    else
        return text;
}
Iiridayn commented 5 years ago

Already some updates as I've moved on to the next scrapper - check not symbol in dataset property and apparently sometimes dom is an array and sometimes not, so handled that.

calebsander commented 5 years ago

I didn't intend for this library to be a drop-in DOM replacement; there are popular libraries for Node.js that do a much better job of that. That is why the DOM representation in html-soup is more of an abstract syntax tree and doesn't directly expose attributes or support computed properties like innerText and innerHTML. With that said, I think dataset seems like a useful property to support, so I'll try to add it. However, the WHATWG standard for innerText depends heavily on how the HTML is rendered, so I don't think it really makes sense to implement it for html-soup. (For example, your code is missing special cases for <ul> and <ol> elements, which have a newline between each list element in Chrome's innerText implementation.)

calebsander commented 5 years ago

Added in 03236c7df74d8fc749ba3e314fa6df755ddd7a65

Iiridayn commented 5 years ago

Makes sense. Might you recommend a drop-in DOM replacement which doesn't use a virtual browser and has minimal dependencies? I admit I only spent about 90 minutes searching before I settled on using your library since it beat out the best alternatives I've been able to find so far.

calebsander commented 5 years ago

jsdom definitely seems like the most full-featured virtual DOM replacement. But you're right that it's not exactly minimal. Other libraries like cheerio are more stripped-down, so they're smaller but don't simulate the DOM quite as well.

Iiridayn commented 5 years ago

Yeah - I looked into both. cheerio seems to be pretty popular, but mimics jQuery instead of querySelector - and while I've only touched on it peripherally and a decade ago, I recall the syntax being slightly different. jsdom on the other hand supports script execution... which is fantastic only for those who need the complexity. Your library on the other hand gets me 95% of the way there, and adding stuff like dataset was the work of only a couple hours. I suspect with jsdom I'd still be reading their documentation... :/.

Iiridayn commented 5 years ago

I may have misremembered though. I might just be acting old and stubborn to resist adding jQuery as a dependency to my browser extension (which I'd think is kinda ridiculous). Still, with it being a 95% solution and me being too stubborn to risk needing a jQuery dependency, html-soup was exactly what I was looking for :).