Closed iMrDJAi closed 1 year ago
@WebReflection Looks like we have typing issues, maybe we should wait for https://github.com/WebReflection/linkedom/pull/168 first before merging this one? (XMLDocument is exported from 'linkedom/types/xml/document').
Edit: Never mind, the correct path was 'linkedom/types/esm/xml/document'.
dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.
dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.
Fair enough. There is not much to do here anyway, tsc
doesn't support @property
yet (https://github.com/microsoft/TypeScript/issues/28730).
For future reference, I think users will have to explicitly specify the argument type of the 'document' event callback:
const domStream = new DOMStream('text/xml', name => name === 'hotelName')
.on('document', async (doc: XMLDocument) => {
//
})
code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.
code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.
@WebReflection I wrote more tests to cover the new feature. Is it good enough?
@iMrDJAi you can npm test
locally and see the table with the code coverage score ... if it's not 100% all rows and fields, is not good enough, if it is, great 🥳
@iMrDJAi coverage down to 92% from 100% https://coveralls.io/builds/53677686 ... if there's some obvious case you think should not be covered you can use special comments around but as this is 100% new code I'd like to see it covered.
P.S. it is very possible that sharing code will also help with code coverage
@WebReflection I don't have to write more tests, I'll do what you've suggested earlier.
will try to have a closer look locally soon, thanks for the refactoring though!
will try to have a closer look locally soon, thanks for the refactoring though!
No problem!
a possible stopper https://github.com/WebReflection/linkedom/issues/180
this module works in JS Workers and if the stream implementation in third part libraries use something not Web compatible, I am not sure (actually I am) this should move forward ... it would rather be a dedicated parser usable only in NodeJS envs ... thoughts?
the stream uses NodeJS specific modules, including emitter, so that pushing this change will break everything out there based on CouldFlare Workers or just Web JS Workers ... this is not going to happen, if my understanding is correct ... feel free to fork and make this module a NodeJS only project, as I have no interests in doing that, because there are already consumers using this module not as a crawler and that's the promise behind: this module is not a crawler and never want to be one.
beside apologizing for my last comment, this PR made me realize I need to improve testing via puppeteer or similar and be sure the module works in browsers, not just in servers. I really did appreciate all the effort around this PR, it's just sneaky for the whole use case this module server (not on purpose, but I'm glad I didn't just merge it)
@WebReflection That's actually a concern I had at the beginning (https://github.com/WebReflection/linkedom/issues/174):
Worried about Node.js streams support for cross environments (browser/deno), including types.
But there is definitely a solution for that. How about env check? polyfills? We may always include polyfills for node stream
(streams are instances of events) and string_decoder
.
And regarding CouldFlare Workers, it would work even without including polyfills: https://developers.cloudflare.com/workers/wrangler/configuration/#node-compatibility.
I'll try with it, it's all about installing a rollup plugin at the end.
It needs to be tested
polyfills? We may always include polyfills for node
@iMrDJAi have you tried using WebStream Api to achieve those 🔥 awesome 🔥 benchmarks
It is supported on all platforms: node deno-deploy cloudflare browser/webworkers
@chr1se Is WebStream API considered a stable feature? Also, htmlparser2
depends on NodeJS streams, so a new adapter is needed? Or we may just use both to achieve maximum compatibility (we can do a runtime environment check)?
This module works in WebWorkers too … worker.js is meant to provide SSR for Service Workers PWA … polyfills are ok only if tiny and battle tested, otherwise are a burden to maintain I’ve no capacity… is a fork targeting NodeJS only with easy upstream sync really off the table? I don’t want to ruin this project scope which is, once again, not crawling. Maybe a plugin system to bring your own parser could be a better solution?
I don’t want to ruin this project scope which is, once again, not crawling.
@WebReflection Not sure why do you assume that I'm interested in crawling, this is by no means my intention, just because the person who opened the first issue (https://github.com/WebReflection/linkedom/issues/98#issuecomment-1015542142) was trying to crawl doesn't mean that everyone else are doing the same. What I'm doing here is trying to achieve what linkedom
has claimed to do:
This structure also allows programs to avoid issues such as "Maximum call stack size exceeded" (basicHTML), or "JavaScript heap out of memory" crashes (JSDOM), thanks to its reduced usage of memory and zero stacks involved, hence scaling better from small to very big documents.
Sure it provided a huge improvement, but still, it doesn't scale well. ¯\_(ツ)_/¯
The point is, this is under the scope of the library, according to README.
I'm aware that you're a busy man, and I respect that, however It's unlikely that anything I've added would force you to take an action in the future or cause complexity in any way, and as for polyfills, the build step is automatic, not sure what do you mean by "burden to maintain".
Maybe a plugin system to bring your own parser could be a better solution?
No, that would be an overkill, most of the files are already classes, and classes are extendable?
like I've said, if the worker.js file works in browsers out of the box and the resulting size is not increased by too many bytes, I'd be OK but it's a priority of this project to work everywhere and not inly in NodeJS ... bun, deno, everywhere like it does right now already. the polyfill part is in case they patch some breaking change and I get the blame, something unfortunately frequent in the npm world.
P.S. not a crawler means it's not the goal of this project to consume the internet at speedlight ... its parser is meant for templates, the fact it works and doesn't throw or bail-out on bigger templates, is indeed part of the goal. The heavy benchmarked websites and pages are there to demo this module won't crash ... parsing huge MB of sites is not the main goal. If this is not clear on the README I can change the readme and make it more clear.
@WebReflection Hello, sorry for the delay. Since polyfills add around 200 kb to the 300 kb worker.js
(can be less if minified), are you okay with implementing Web Streams API instead? Otherwise, I'm okay with not merging and keeping my changes in my fork (unless you become interested or this feature gains more demand in the future, then I'll be happy to reopen the pull request again).
What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default
What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default
Actually, that's a good idea! A mono repo with multiple npm packages would work. Would you like to proceed?
Closed since I got no confirmation from maintainer to proceed.
Introducing
DOMStream
: an efficient way for parsing very large HTML/XML documents!Usage:
Using
DOMParser
: Time: ~11mins, Memory: ~3GB (I gotheap out of memory
error but I was able to resume from JavaScript debug console).Using
DOMStream
🔥: Time: ~27s, Memory: ~13MB.