NaturalIntelligence / fast-xml-parser

Validate XML, Parse XML and Build XML rapidly without C/C++ based libraries and no callback.
https://naturalintelligence.github.io/fast-xml-parser/
MIT License
2.49k stars 302 forks source link

Carriage Return characters (\r) are incorrectly replaced with Newline characters (\n) #512

Open ericdrobinson opened 1 year ago

ericdrobinson commented 1 year ago

Description

Carriage Return characters (\r) found within parsed XML are incorrectly converted to newline characters (\n).

I did a quick scan of the source code and found a likely culprit:

const parseXml = function(xmlData) {
  xmlData = xmlData.replace(/\r\n?/g, "\n"); //TODO: remove this line

That pretty clearly replaces any \r character (and possibly \r\n pair) with \n.

The online tool does not exhibit this issue because browser node access APIs encode Carriage Return "strings" (the character \ followed by r) as \\r. The regular expression no longer matches. See:

> "1) \r 2) \\r 3) \n 4) \\n 5) \r\n 6) \\r\\n".replace(/\r\n?/g, "\n")
  '1) \n 2) \\r 3) \n 4) \\n 5) \n 6) \\r\\n'

Input

The XML that exhibits this issue is of the form:

<properties object="" engine="">
    <property type="string" name="x" state="changed">
        <![CDATA[This is a carriage return \r...]]>
    </property>
    <property type="string" name="y" state="changed">
        <![CDATA[\r]]>
    </property>
</properties>

Code

The code is pretty straightforward.

const XML_OPTIONS_NO_TAG_PARSE: fastXMLParser.X2jOptionsOptional = {
    attributeNamePrefix: "@",
    ignoreAttributes: false,
    parseAttributeValue: false,
    parseTagValue: false,
    textNodeName: "#value",
};
const XML_PARSER_NO_TAG_PARSE = new fastXMLParser.XMLParser(XML_OPTIONS_NO_TAG_PARSE);

// ...

const parsed = XML_PARSER_NO_TAG_PARSE.parse(xmlData);

After that code runs, the parsed text node content has \n instead of the expected \r.

Output

Running the above results in the following JSON:

{
    "properties": {
        "property": [
            {
                "#value": "This is a carriage return \n...",
                "@type": "string",
                "@name": "x",
                "@state": "changed"
            },
            {
                "#value": "\n",
                "@type": "string",
                "@name": "y",
                "@state": "changed"
            },
        ],
        "@object": "",
        "@engine": ""
    }
}

Expected Data

I expect the following output:

{
    "properties": {
        "property": [
            {
                "#value": "This is a carriage return \r...",
                "@type": "string",
                "@name": "x",
                "@state": "changed"
            },
            {
                "#value": "\r",
                "@type": "string",
                "@name": "y",
                "@state": "changed"
            },
        ],
        "@object": "",
        "@engine": ""
    }
}

Would you like to work on this issue?

github-actions[bot] commented 1 year ago

I'm glad you find this repository helpful. I'll try to address your issue ASAP. You can watch the repo for new changes or star it.

ericdrobinson commented 1 year ago

I have just confirmed that this line:

const parseXml = function(xmlData) {
  xmlData = xmlData.replace(/\r\n?/g, "\n"); //TODO: remove this line

is responsible for the bug. (Verified by stepping through with a debugger.)

ericdrobinson commented 1 year ago

I have also determined that the reason the online tool doesn't exhibit this issue is because the XML string returned by browser node access APIs (used internally by CodeMirror) returns carriage return characters as \\r instead of \r.

Depending upon the purpose of the problematic line of code that I identified, this may actually cause a bug in the online tool.

Regardless, this is clearly not operating as expected.

nkappler commented 4 months ago

this seems to only happen when trimValues is turned on, at least in my use case, I get a lot more new lines from the text node when I turn it on.

amitguptagwl commented 4 months ago

can you check with v5 if it works?

nkappler commented 4 months ago

is v5 not available as an npm module? How do I install it and where do I get it?

amitguptagwl commented 4 months ago

https://github.com/NaturalIntelligence/fast-xml-parser/blob/master/docs/v5/1.%20Getting%20Started.md

yelly commented 2 months ago

This issue is also affecting us as we need to parse XML files which do not encode carriage returns. We can't really move to v5 because we are also building XML files which isn't supported yet (I guess we could parse with v5 and build with v4 but that seems a bit cumbersome). I know that replacing all CRLFs or CRs with LFs is in the XML spec, but it would be nice to be able to turn this off (at least within values, between tags newlines are ignored anyway). In the meantime as a workaround we are manually encoding all CRs before calling parse and then decoding them in the tagValueProcessor which may lead to bugs in some strange edge cases.

amitguptagwl commented 2 months ago

@yelly , I can understand the trouble you're experiencing. But unfortunately, I'll not be able to do any changes in the code for next 2-3 months. I would take v5 ahead for new changes and build the other components like validator, builder etc for v5. I just wanted to take a feedback on v5 before I do any big change.

livehigh commented 1 month ago

will it be fixed in v4?