Leonidas-from-XIV / node-xml2js

XML to JavaScript object converter.
MIT License
4.88k stars 604 forks source link

tried to load kanjidic2.xml, no errors, no data either #140

Closed Pomax closed 6 years ago

Pomax commented 10 years ago

as per the README I tried to run this:

var fs = require("fs");
var parser = require('xml2js');
fs.readFile('kanjidic2.xml', function(err, data) {
    parser.parseString(data, function (err, result) {
        console.dir(result);
        console.log('Done');
    });
});

on this: ftp://ftp.monash.edu.au/pub/nihongo/kanjidic2.xml.gz

The result was

undefined
Done

that doesn't seem right.

jcsahnwaldt commented 6 years ago

@Pomax You should check err before you access result. In this case, console.log(err) probably would have printed this:

Error: Text data outside of root node.
Line: 327
Column: 1
Char: ]
    at error (/Users/jcsahnwaldt/git/digitalHub/node_modules/sax/lib/sax.js:651:10)
    at strictFail (/Users/jcsahnwaldt/git/digitalHub/node_modules/sax/lib/sax.js:677:7)
    at SAXParser.write (/Users/jcsahnwaldt/git/digitalHub/node_modules/sax/lib/sax.js:1035:15)
    at Parser.exports.Parser.Parser.parseString (/Users/jcsahnwaldt/git/digitalHub/node_modules/xml2js/lib/parser.js:322:31)
    at Parser.parseString (/Users/jcsahnwaldt/git/digitalHub/node_modules/xml2js/lib/parser.js:5:59)
    at Object.<anonymous> (/Users/jcsahnwaldt/git/digitalHub/foo.js:34:8)
    at Module._compile (module.js:569:30)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
jcsahnwaldt commented 6 years ago

This is a bug in sax-js. See https://github.com/isaacs/sax-js/issues/236

Your XML contains a DTD with comments that contain closing square brackets. For some reason, sax-js gets confused by these closing square brackets.

When I removed these closing square brackets from the comments, I got a different error:

Error: Max buffer length exceeded: doctype
Line: 535159
Column: 0
Char: 
    at error (.../sax.js:651:10)
    at checkBufferLength (.../sax.js:125:13)
    at SAXParser.write (.../sax.js:1505:7)
    ...

When I removed all the comments from the DTD (about 280 comment lines between <!DOCTYPE kanjidic2 [ and ]>), xml2js could parse the file and produced this result:

{ kanjidic2: 
   { header: 
      [ { file_version: [ '4' ],
          database_version: [ '2018-160' ],
          date_of_creation: [ '2018-06-09' ] } ],
     character: 
      [ { literal: [ '亜' ],
          codepoint: 
           [ { cp_value: 
                [ { _: '4e9c', '$': { cp_type: 'ucs' } },
                  { _: '16-01', '$': { cp_type: 'jis208' } } ] } ],
                  ...

(It goes on like this for thousands of lines...)

Pomax commented 6 years ago

probably worth mentioning in the README.md in a gotcha section or the like. I've not needed this for four years now, but maybe someone else has run into this, since.