NaturalIntelligence / fast-xml-parser

Validate XML, Parse XML and Build XML rapidly without C/C++ based libraries and no callback.
https://naturalintelligence.github.io/fast-xml-parser/
MIT License
2.43k stars 297 forks source link

unreliable for parsing html #629

Open hulkish opened 6 months ago

hulkish commented 6 months ago

Description

Finding it unsafe to use this library for dependable html parsing

Input

create test-fxp.js:

const util = require('node:util');
const {XMLParser, XMLValidator} = require('fast-xml-parser');

(async () => {
  for (const url of [
    'https://nytimes.com',
    'https://cnn.com',
    'https://nypost.com',
    'https://reddit.com',
    'https://github.com'
  ]) {
    const html = await (await fetch(url)).text();
    const parsingOptions = {
      ignoreAttributes: false,
      preserveOrder: true,
      unpairedTags: ['hr', 'br', 'link', 'meta'],
      stopNodes: ['*.pre', '*.script'],
      processEntities: true,
      htmlEntities: true,
    };
    const parser = new XMLParser(parsingOptions);
    try {
      const result = await parser.parse(html);
      console.log(`Success: ${url}:`, util.inspect(result, { depth: 1, colors: true }));
    } catch (err) {
      console.error(`Fail: ${url}:`, err);
    }
  }
})();

Output

run:

node test-fxp.js
Fail: https://nytimes.com: Error: Unexpected end of script
    at OrderedObjParser.parseXml (<pwd>/node_modules/fast-xml-parser/src/xmlparser/OrderedObjParser.js:323:31)
    at XMLParser.parse (<pwd>/node_modules/fast-xml-parser/src/xmlparser/XMLParser.js:35:48)
    at <pwd>/test-fxp.js:23:35
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Fail: https://cnn.com: Error: StopNode is not closed.
    at findClosingIndex (<pwd>/node_modules/fast-xml-parser/src/xmlparser/OrderedObjParser.js:489:11)
    at OrderedObjParser.readStopNodeData (<pwd>/node_modules/fast-xml-parser/src/xmlparser/OrderedObjParser.js:558:30)
    at OrderedObjParser.parseXml (<pwd>/node_modules/fast-xml-parser/src/xmlparser/OrderedObjParser.js:322:33)
    at XMLParser.parse (<pwd>/node_modules/fast-xml-parser/src/xmlparser/XMLParser.js:35:48)
    at <pwd>/test-fxp.js:23:35
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Fail: https://nypost.com: Error: Unexpected end of script
    at OrderedObjParser.parseXml (<pwd>/node_modules/fast-xml-parser/src/xmlparser/OrderedObjParser.js:323:31)
    at XMLParser.parse (<pwd>/node_modules/fast-xml-parser/src/xmlparser/XMLParser.js:35:48)
    at <pwd>/test-fxp.js:23:35
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Success: https://reddit.com: [ { '!doctype': [Array] }, { p: [Array] }, { p: [Array] } ]
Success: https://github.com: [ { html: [Array], ':@': [Object] } ]

expected data

Some kind of way to handle broken html without failing the entire parsing process

Would you like to work on this issue?

github-actions[bot] commented 6 months ago

We're glad you find this project helpful. We'll try to address this issue ASAP. You can vist https://solothought.com to know recent features. Don't forget to star this repo.

amitguptagwl commented 6 months ago

@hulkish This library is suitable to handle all scenarios of HTML. However, in next version, we're planning to handle some of them.