Support for streams - Githubissues

iMrDJAi commented 1 year ago

In order to reduce memory usage, and to improve performance, it's a smart move to provide support for streams, allowing to process very large inputs chunk by chunk. In this way, we don't have to load the whole thing in memory, and also, we may prevent heap out of memory errors.

htmlparser2 supports streams out of the box, the implementation shouldn't be hard at all, but I have some concerns:

Should we really make parseFromString() accept streams as markupLanguage? Wouldn't that break the DOM standard? Isn't a separate parseFromStream() more appropriate?
Worried about Node.js streams support for cross environments support (browser/deno), including types.

Another thing: outputs as streams (https://github.com/WebReflection/linkedom/issues/98). This one would be trickier! a new method should be introduced to serve the purpose. The idea is we treat nodes as individual documents instead of appending them to the DOM, we keep processing their child nodes till we reach the close tag, then we push them into the stream flow (not necessarily a stream. It could be a callback, an EventEmitter), and we only do that to the ones that meet the criteria specified by user, for example we pass an array containing tag names or attributes that we need to catch (I can't see it possible with selectors but that would be neat!), and we ignore everything else. I feel like this is more like watching nodes than streaming them in my opinion.

Why not just using htmlparser2 directly instead? Well, we need processed output with DOM methods and properties, the ability to access child nodes, instead of the raw output that htmlparser2 returns, otherwise what's the point?

I'm actually willing to contribute, but I need your feedback first, you may have a better idea regarding that. Thank you.

WebReflection commented 1 year ago

ehr .... differently from JSDOM this module parses pages with dozen MB of HTML without issues (it's part of the test cases) and it needs to link nodes to work for any string representation, as explained in here.

my feedback there is the same I'd give in here and I don't understand the use case for this or what you are trying to do.

partial content and selectors or methods or anything else would be broken anyway and there are ways to increment the heap, even if I've never needed it.

do you have a real-world case where this module breaks or a reason to crawl that page with LinkeDOM instead of using a real browser?

iMrDJAi commented 1 year ago

ehr .... differently from JSDOM this module parses pages with dozen MB of HTML without issues (it's part of the test cases) and it needs to link nodes to work for any string representation, as explained in here.

@WebReflection At the beginning I used JSDOM and it was struggling to process anything with more than 10~20 MB. I switched to LinkeDOM and I immediately noticed a huge improvement, it simply uses less memory than JSDOM.

there are ways to increment the heap, even if I've never needed it.

However, that wasn't enough, my Linux server with 8GB of RAM experienced almost 100% of memory usage while trying to parse ~80 MB file with LinkeDOM, I was always hitting heap out of memory no matter what, increasing heap wouldn't help at all since Node has been eating up more than 7GB of RAM! I believe this is the limit and I can't go further than that.

my feedback there is the same I'd give in here and I don't understand the use case for this or what you are trying to do.

Lemme explain: as we all know Node.js and the JavaScript world isn't well served when it's about XML, because JSON is simply superior, and I totally agree. However, some APIs still use XML, and in order to write clients for them I'll need a good and a performant parser that outputs something I can work with like a DOM object.
In my case, I'm working with a hotel reservation API, and in order to download its complete list of hotels, I have to deal with a really huge amount of XML, I'm talking about tens of MBs!

do you have a real-world case where this module breaks or a reason to crawl that page with LinkeDOM instead of using a real browser?

That was my case, using an embedded browser like puppeteer isn't practical at all, it would require even more memory. Also, I don't crawl pages or work with HTML in the first place, I process XML and extract data from it using query selectors, and that worked for smaller files.

partial content and selectors or methods or anything else would be broken anyway

Unless you create a new document for each partial content, remember that we don't need to work from the root, we only need to use selectors within the partial content itself.

WebReflection commented 1 year ago

I'm afraid I don't see this module a good answer to your use case ... LinkeDOM is to serve pages, create pages, it's not a crawler and it doesn't need to optimize for crawling ... streaming might be interesting only for documents but then again, streaming in that sense would be better off via templates and no DOM logic around whatsoever as it would be useless.

If htmlparser2 supports streams out of the box I suggest you use just that and, eventually, implement the querySelectorAll as it's done in here ... maybe I can factor that out as third party module so it can be reused by other projects? Otherwise this looks to me like a won-t fix also because I have no time for this and because it seems complicated enough to throw away all performance decisions made around this module.

Closing but happy to hear thoughts about what I've just written.

iMrDJAi commented 1 year ago

I'm afraid I don't see this module a good answer to your use case

@WebReflection Indeed, this is why I'm willing to make changes to it and to add streams support.

streaming in that sense would be better off via templates and no DOM logic around whatsoever as it would be useless.

If htmlparser2 supports streams out of the box I suggest you use just that and, eventually, implement the querySelectorAll as it's done in here

Well, probably that's right, in fact it is possible to extract segments from the stream as XML/HTML strings, then passing them to LinkeDOM one by one. However, this approach will require htmlparser2 to work with the same segment twice, and I can't see why that wouldn't be a built-in feature instead, as it will make the code reusable.

maybe I can factor that out as third party module so it can be reused by other projects? Otherwise this looks to me like a won-t fix also because I have no time for this and because it seems complicated enough to throw away all performance decisions made around this module.

Actually, you don't have to do anything, I'll try with it myself then I'll open a pull request once it's done. In my opinion, this wouldn't affect the performance decisions at all, the concept of triple-linked lists is a plus even for partial content.

If you don't think so, it could be a third-party library that extends this one instead, but that wouldn't be ideal as it require maintenance of a separate library.

WebReflection commented 1 year ago

Happy to take a look at the possible pool request but if it affects parsing time in some way please be aware it might not be welcome

iMrDJAi commented 1 year ago

Happy to take a look at the possible pool request but if it affects parsing time in some way please be aware it might not be welcome

No problem at all! Actually markupLanguage as a stream is very minimal and I think I'll go ahead and implement it inside parse-from-string.js. Streamable outputs are another story and I'll attempt to create a new function for them.

iMrDJAi commented 1 year ago

@WebReflection 👀

WebReflection commented 1 year ago

it's just Sunday mate ... I'll have a look tomorrow at some point ... bear with me

WebReflection / linkedom

Support for streams #174