Streams support (pt. 2)

iMrDJAi commented 1 year ago

Introducing DOMStream: an efficient way for parsing very large HTML/XML documents!

Usage:

import {DOMStream} from 'linkedom'

const src = fs.createReadStream(join(__dirname, './big.xml')) // ~80 MB!
const domStream = new DOMStream('text/xml', (name, _attributes) => {
  // From htmlparser2's `onopentag`
  return name === 'hotelName'
})

src.pipe(domStream).on('document', doc => {
  // Magic!
  console.log(doc.documentElement.outerHTML)
})

Using DOMParser: Time: ~11mins, Memory: ~3GB (I got heap out of memory error but I was able to resume from JavaScript debug console).

Using DOMStream 🔥: Time: ~27s, Memory: ~13MB.

iMrDJAi commented 1 year ago

@WebReflection Looks like we have typing issues, maybe we should wait for https://github.com/WebReflection/linkedom/pull/168 first before merging this one? (XMLDocument is exported from 'linkedom/types/xml/document').

Edit: Never mind, the correct path was 'linkedom/types/esm/xml/document'.

WebReflection commented 1 year ago

dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.

iMrDJAi commented 1 year ago

dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.

Fair enough. There is not much to do here anyway, tsc doesn't support @property yet (https://github.com/microsoft/TypeScript/issues/28730).

For future reference, I think users will have to explicitly specify the argument type of the 'document' event callback:

 const domStream = new DOMStream('text/xml', name => name === 'hotelName')
   .on('document', async (doc: XMLDocument) => {
     //
   })

WebReflection commented 1 year ago

code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.

iMrDJAi commented 1 year ago

code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.

@WebReflection I wrote more tests to cover the new feature. Is it good enough?

WebReflection commented 1 year ago

@iMrDJAi you can npm test locally and see the table with the code coverage score ... if it's not 100% all rows and fields, is not good enough, if it is, great 🥳

WebReflection commented 1 year ago

@iMrDJAi coverage down to 92% from 100% https://coveralls.io/builds/53677686 ... if there's some obvious case you think should not be covered you can use special comments around but as this is 100% new code I'd like to see it covered.

iMrDJAi commented 1 year ago

P.S. it is very possible that sharing code will also help with code coverage

@WebReflection I don't have to write more tests, I'll do what you've suggested earlier.

WebReflection commented 1 year ago

will try to have a closer look locally soon, thanks for the refactoring though!

iMrDJAi commented 1 year ago

will try to have a closer look locally soon, thanks for the refactoring though!

No problem!

WebReflection commented 1 year ago

a possible stopper https://github.com/WebReflection/linkedom/issues/180

this module works in JS Workers and if the stream implementation in third part libraries use something not Web compatible, I am not sure (actually I am) this should move forward ... it would rather be a dedicated parser usable only in NodeJS envs ... thoughts?

WebReflection commented 1 year ago

the stream uses NodeJS specific modules, including emitter, so that pushing this change will break everything out there based on CouldFlare Workers or just Web JS Workers ... this is not going to happen, if my understanding is correct ... feel free to fork and make this module a NodeJS only project, as I have no interests in doing that, because there are already consumers using this module not as a crawler and that's the promise behind: this module is not a crawler and never want to be one.

WebReflection commented 1 year ago

beside apologizing for my last comment, this PR made me realize I need to improve testing via puppeteer or similar and be sure the module works in browsers, not just in servers. I really did appreciate all the effort around this PR, it's just sneaky for the whole use case this module server (not on purpose, but I'm glad I didn't just merge it)

iMrDJAi commented 1 year ago

@WebReflection That's actually a concern I had at the beginning (https://github.com/WebReflection/linkedom/issues/174):

Worried about Node.js streams support for cross environments (browser/deno), including types.

But there is definitely a solution for that. How about env check? polyfills? We may always include polyfills for node stream (streams are instances of events) and string_decoder.

And regarding CouldFlare Workers, it would work even without including polyfills: https://developers.cloudflare.com/workers/wrangler/configuration/#node-compatibility.

I'll try with it, it's all about installing a rollup plugin at the end.

iMrDJAi commented 1 year ago

It needs to be tested

chr1se commented 1 year ago

polyfills? We may always include polyfills for node

@iMrDJAi have you tried using WebStream Api to achieve those 🔥 awesome 🔥 benchmarks

It is supported on all platforms: node deno-deploy cloudflare browser/webworkers

iMrDJAi commented 1 year ago

@chr1se Is WebStream API considered a stable feature? Also, htmlparser2 depends on NodeJS streams, so a new adapter is needed? Or we may just use both to achieve maximum compatibility (we can do a runtime environment check)?

WebReflection commented 1 year ago

This module works in WebWorkers too … worker.js is meant to provide SSR for Service Workers PWA … polyfills are ok only if tiny and battle tested, otherwise are a burden to maintain I’ve no capacity… is a fork targeting NodeJS only with easy upstream sync really off the table? I don’t want to ruin this project scope which is, once again, not crawling. Maybe a plugin system to bring your own parser could be a better solution?

iMrDJAi commented 1 year ago

I don’t want to ruin this project scope which is, once again, not crawling.

@WebReflection Not sure why do you assume that I'm interested in crawling, this is by no means my intention, just because the person who opened the first issue (https://github.com/WebReflection/linkedom/issues/98#issuecomment-1015542142) was trying to crawl doesn't mean that everyone else are doing the same. What I'm doing here is trying to achieve what linkedom has claimed to do:

This structure also allows programs to avoid issues such as "Maximum call stack size exceeded" (basicHTML), or "JavaScript heap out of memory" crashes (JSDOM), thanks to its reduced usage of memory and zero stacks involved, hence scaling better from small to very big documents.

Sure it provided a huge improvement, but still, it doesn't scale well. ¯\_(ツ)_/¯
The point is, this is under the scope of the library, according to README.

I'm aware that you're a busy man, and I respect that, however It's unlikely that anything I've added would force you to take an action in the future or cause complexity in any way, and as for polyfills, the build step is automatic, not sure what do you mean by "burden to maintain".

Maybe a plugin system to bring your own parser could be a better solution?

No, that would be an overkill, most of the files are already classes, and classes are extendable?

WebReflection commented 1 year ago

like I've said, if the worker.js file works in browsers out of the box and the resulting size is not increased by too many bytes, I'd be OK but it's a priority of this project to work everywhere and not inly in NodeJS ... bun, deno, everywhere like it does right now already. the polyfill part is in case they patch some breaking change and I get the blame, something unfortunately frequent in the npm world.

P.S. not a crawler means it's not the goal of this project to consume the internet at speedlight ... its parser is meant for templates, the fact it works and doesn't throw or bail-out on bigger templates, is indeed part of the goal. The heavy benchmarked websites and pages are there to demo this module won't crash ... parsing huge MB of sites is not the main goal. If this is not clear on the README I can change the readme and make it more clear.

iMrDJAi commented 1 year ago

@WebReflection Hello, sorry for the delay. Since polyfills add around 200 kb to the 300 kb worker.js (can be less if minified), are you okay with implementing Web Streams API instead? Otherwise, I'm okay with not merging and keeping my changes in my fork (unless you become interested or this feature gains more demand in the future, then I'll be happy to reopen the pull request again).

WebReflection commented 1 year ago

What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default

iMrDJAi commented 1 year ago

What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default

Actually, that's a good idea! A mono repo with multiple npm packages would work. Would you like to proceed?

iMrDJAi commented 1 year ago

Closed since I got no confirmation from maintainer to proceed.

WebReflection / linkedom

Streams support (pt. 2) #177