evanderkoogh / node-sitemap-stream-parser

A streaming parser for sitemap files. Is able to deal with deeply nested sitemaps with 100+ million urls in them.
Apache License 2.0
38 stars 18 forks source link

High CPU usage with nested sitemaps #9

Open fdammassa opened 6 years ago

fdammassa commented 6 years ago

I'm experiencing a very high CPU utilization (100%) with large nested sitemaps.

The url callback is very simple since it increments a counter.

Could this be related to the "blocking" nature of url (and sitemap) callbacks? If you point out towards the right direction I can contribute to the project.

As an example, you could try this sitemap: https://www.walmart.com/sitemap_ip.xml

evanderkoogh commented 6 years ago

Hey,

I have picked a default of 4 parallel executions of parsing sitemaps (https://github.com/evanderkoogh/node-sitemap-stream-parser/blob/33ba4d9d958783e6f4598ab64e6ad0644da3d22f/index.coffee#L64).

I would play around with the settings on that. And if setting it to another value improves the experience for you it would be great to have that setting be configurable. Let me know if you need any more pointers of that.

knoxcard commented 6 years ago

How about implementing process.nextTick() inside the loop callback?

evanderkoogh commented 6 years ago

Hey @fdammassa. Thanks for opening an issue, I finally had some time to investigate the issue and almost all the time is spent in parsing XML. Unfortunately parsing XML is extremely expensive and CPU intensive. And these sitemaps are many MBs of XML.

If you can give me a bit more context about what you are trying to do I might be able to help a bit better.