Leonidas-from-XIV / node-xml2js

XML to JavaScript object converter.
MIT License
4.87k stars 601 forks source link

0.1 to 0.4 upgrade big slowdown #157

Open Redsandro opened 9 years ago

Redsandro commented 9 years ago

Hi people,

I have been using node-xml2js v0.1.14's xml2json.Parser().parseString without further options on XML files which kept getting bigger. Now these files up to ~2MB take nearly a second to translate to js.

So I thought, let's update and see if there are speed improvements. However, after I updated node-xml2js to v0.4.4, I noticed that in stead of a speed increase, I get a big slowdown were it now takes 8 to 10 seconds.

Are there some special options necessary in the 0.4 version? Or is this module just not meant for bigger XML? Something else I'm missing?

For now I'm downgrading to the much faster version 0.1, and I'll continue my search for fast converters.

Leonidas-from-XIV commented 9 years ago

Are you using the 0.1 settings or 0.2 settings in xml2js?

Redsandro commented 9 years ago

I don't set any options/settings, so I am guessing it's the 0.1 settings since I am using 0.1.14. (?)

After using multiple measurements, I have to correct myself and say that 0.4.4 is 'only' about twice as slow as 0.1.14 in the above issue.

Leonidas-from-XIV commented 9 years ago

Then please try with the 0.1 settings in 0.4, see the README and let me know how the numbers look like.

Redsandro commented 9 years ago

Ok. I'll get back to you.

Redsandro commented 9 years ago

Trying 0.1.14... Done. 1998 ms Trying 0.4.4 using 0.1 defaults... Done. 3298 ms Trying 0.4.4 using 0.2 defaults... Done. 3836 ms

var path        = require('path');
var fs          = require('fs');
var q           = require('q');
    q.longStackSupport = true;

var xml2js01    = require('xml2js0114');
var xml2js04    = require('xml2js');
var p01         = new xml2js01.Parser();
var p04a        = new xml2js04.Parser(xml2js04.defaults["0.1"]);
var p04b        = new xml2js04.Parser(xml2js04.defaults["0.2"]);

var fileName    = 'soccer.xml';
var timeMs;

var pwd         = path.dirname(require.main.filename);
var file        = path.join(pwd, fileName);

var xml         = fs.readFileSync(file, 'utf8');

q(true)
.then(function() {
    console.log('Trying 0.1.14...');
    stopwatch(false);

    return q.nfcall(p01.parseString, xml);
})
.then(function(json) {
    console.log('Done.');
    stopwatch();

    return;
})
.then(function() {
    console.log('Trying 0.4.4 using 0.1 defaults...');
    stopwatch(false);

    return q.nfcall(p04a.parseString, xml);
})
.then(function(json) {
    console.log('Done.');
    stopwatch();

    return;
})
.then(function() {
    console.log('Trying 0.4.4 using 0.2 defaults...');
    stopwatch(false);

    return q.nfcall(p04b.parseString, xml);
})
.then(function(json) {
    console.log('Done.');
    stopwatch();

    return;
})
.fail(function(e){
    console.log(e.stack);
})
.done();

function stopwatch(log) {
    if (timeMs && log !== false)
        console.log((Date.now() - timeMs) + ' ms');

    timeMs = Date.now();
}

These timings are fluctuating +/- 200ms.

Leonidas-from-XIV commented 9 years ago

Hmm, yes, for now I'd recommend a faster XML parser. Maybe once I get to finish the htmlparser2 port it will get faster, but that one is not a big priority right now, sorry.

Redsandro commented 9 years ago

Do you know any other parsers? For now I put node-xml2json and node-xml2object in the same testing setup and they perform equal at best. There was one simple-xml-to-json (iirc) that performed so badly that I removed it from the tests. I also tried a binary non-node parser called xml-json but it also took twice as long.

One would think that a node module using a binary component could outperform pure javascript modules but the availability is meager at best. So I am hoping I'm missing something. There's a lot of XML out there.

Leonidas-from-XIV commented 9 years ago

You could try node-xml2js-expat which was forked before I took over so it is at the state of xml2js 0.1.x but replaces saxjs with Expat which is written in C.

For me one of the priorities had been to go without native compilation, but if you need the speed, xml2js is admittedly not the best choice.

Redsandro commented 9 years ago

Thank you. I will add this to my tests.

And as general words of praise, especially for smaller XML files, xml2js has always been friendly to me.

Just curious, since expat has SAX bindings, could node-expat be a relatively hassle-free drop-in replacement to sax-js in xml2js? An option or switch in source to use a compiled parser might make some people happy; those working with XML files that started young and light but have grown old and ugly. ;)

Leonidas-from-XIV commented 9 years ago

It might be possible, I haven't given this any thought. You sure the node-expat binding exports the SAX API?

Redsandro commented 9 years ago

Actually I am not sure. There is no wiki and I cannot find any documentation.

But the title of the repo is:

node-xmpp/node-expat

libexpat XML SAX parser binding for node.js

And since SAX implies API (Simple Api for Xml) I blatantly assumed it did. :P

csimi commented 8 years ago

If anyone is still interested, EasySax seems to be a damn fast parser, written in JS. I had to fix some bugs in easysax myself and integrate it into xml2js but it sped up things pretty well.

sax x 57,812 ops/sec ±7.41% (78 runs sampled)
node-xml x 76,807 ops/sec ±1.75% (87 runs sampled)
libxmljs x 163,375 ops/sec ±2.58% (88 runs sampled)
node-expat x 201,663 ops/sec ±0.76% (84 runs sampled)
easysax x 828,169 ops/sec ±2.59% (86 runs sampled)

I feel there has to be a catch somewhere but I don't see it yet, other than the weird characters in the documentation.

Redsandro commented 8 years ago

Seems to be streaming. Doesn't convert to object itself. (Read: Not inline-replacable with node-xml2js) Am I wrong?

csimi commented 8 years ago

Yeah, this is regarding to the discussion about the sax-js "backend" of xml2js and replacing it with node-expat. I've seen a drop in CPU usage of around 50% after replacing sax-js with easysax. Much of the time is still spent on building an object of the whole XML so I'm thinking about just straight-up using the SAX parser. We're saturating our 100Mbps proxy vms with gzipped XML files so for parsing that much of data every little speedup counts.

tflanagan commented 8 years ago

How does easysax (or any of these other libs) work with browserify?

Leonidas-from-XIV commented 8 years ago

@csimi Does your version with the easysax backend pass the unit tests? I am in no means married to sax-js, I just want to avoid a dependency that has to be compiled.

On the other hand, if you saturate your network connection, serializing XML into an object is probably going to be inherently expensive, if you have high performance in mind a streaming solution like raw SAX or similar might indeed be preferable.

tflanagan commented 8 years ago

@Leonidas-from-XIV, if you replace the parser with one that does not work 100% with browserify, then I will be forced to fork it.

Can we introduce an external hook rather than outright replacing it? Side effects could be huge - If one of these replacement libs is actually async, unlike sax-js (because of eventemitter), then a lot of people will come screaming.

Leonidas-from-XIV commented 8 years ago

@tflanagan I understand your issue and will try to take it into account.

Your complaint is actually why I would be agains supporting multiple backends, because then the semantics of the library might change in unforeseen ways and some people will be surprised trying to have the same behaviour everywhere might end up an uphill battle for little benefit. I'd rather have one solid backend that works for everybody.

kyrylkov commented 8 years ago

@Redsandro Russian intro on easysax page actually says it's not streaming

kyrylkov commented 8 years ago

@csimi can you publish your xml2js + easysax repo?

kyrylkov commented 8 years ago

My run on a 46kB XML file:

sax x 66.98 ops/sec ±2.62% (50 runs sampled)
node-xml x 70.33 ops/sec ±2.56% (54 runs sampled)
libxmljs x 162 ops/sec ±2.46% (62 runs sampled)
node-expat x 116 ops/sec ±3.50% (59 runs sampled)
ltx x 236 ops/sec ±3.67% (60 runs sampled)
EasySax x 1,167 ops/sec ±4.62% (59 runs sampled)
Fastest is EasySax
Redsandro commented 8 years ago

@kyrylkov just for comparison, can you add RapidX2J to this comparison? It will illustrate using RapidXML backend.

kyrylkov commented 8 years ago

@Redsandro It doesn't seem to compile with Node.js 5.8.0? Does it support Node.js 4.x and 5.x?

Redsandro commented 8 years ago

@kyrylkov oops no idea actually. I'm running legacy (pre io.js-post-fork-merge) node.js for production reasons.

Redsandro commented 8 years ago

@kyrylkov Compiles on 4.x according to dev:

https://github.com/damirn/rapidx2j/issues/17

it should work with node 4.4.0:

git clone https://github.com/damirn/rapidx2j.git
npm install rapidx2j/

I just tried it on mac os x w/o issues

csimi commented 8 years ago

I'll try to put something usable together during the weekend. More benchmarks:

xml2js x 2.33 ops/sec ±27.49% (10 runs sampled) xml2js easysax x 11.31 ops/sec ±19.67% (18 runs sampled) rapidx2j x 28.87 ops/sec ±16.22% (30 runs sampled) easysax x 129 ops/sec ±15.94% (30 runs sampled)

The easysax bench doesn't actually build a js object, just runs a SAX pass over my 215KiB XML test file (fastest mode, without parsing attributes, etc).

Rapidx2j is pretty fast but just how much time it takes to actually build a JS object is clearly visible.

I feel like EasySax is too fast (compared to expat for example) to be standards compliant. There has to be a catch somewhere.

Redsandro commented 8 years ago

how much time it takes to actually build a JS object is clearly visible.

How do you mean? I need to create that JS object anyhow, so I prefer the module takes care of that. In the case of rapidx2j, the js object is built in the compiled code. I think that's why it's so much faster. In the case of node-xml2js, it happens in javascript, which is slower. I believe this overhead will be added to EasySax.

Is there a (simple) way to use EasySax for getting the JS object out of xml? I'd like to see the time it takes including building the JS object. But in the docs it seems to just throw events on every element encountered (that's what I assumed was streaming, as SAX works like that too).

So far, rapidx2j seems the fastest but I cannot use it without forking, as the current module implements a custom lossy standard, changing caps and such.