CDATA scanner not scanning content properly

akshay-kr commented 3 weeks ago

It seems the CDATA scanning is behaving different after below commit, https://github.com/HtmlUnit/htmlunit-neko/commit/49a31c089482088aca57facb20e1a15792cfc3bd

For example if it has to scan <div></div> it just scans till <div and then append CDATA closing <div]]. Rest of the contents are not scanned.

I guess it is due to addition of below condition,

else if (c == '>') {
    // don't add the ]] to the buffer
    return false;
}

https://github.com/HtmlUnit/htmlunit-neko/commit/49a31c089482088aca57facb20e1a15792cfc3bd#diff-c0ef5d78bafd2e2d032b106fd0ca50086966f97030bd6b6f4403d8f5ebe39f87R2490

I am using version 3.11.1

Any help here will be really appreciated.

rbri commented 3 weeks ago

Hi @akshay-kr,

correct, the cdata handling was changed to be in sync with the spec. You can read a bit about the reasoning for the changes in #102.

If you think there is an error in the parser please provide a small html sample. You can open this samples in your browser and then have a look at the DOM tree with the developer tools. If neko generates a different one i will try to fix it.

And curious to learn more about your use case....

akshay-kr commented 3 weeks ago

Hi @rbri Thanks for replying.

So basically if I have this as an input <![CDATA[<div></div>]]> then after CDATA scanning I get the output as <![CDATA[<div]]>]]>.

The part after the first > inside CDATA section is ignored.

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters <div></div> as a code snippet only <div is stored after sanitisation.

It used to work properly before this committ.

Please do let me know if this information works for you else I will try to see if I can provide additional information.

rbri commented 3 weeks ago

So basically if I have this as an input <![CDATA[
]]> then after CDATA scanning I get the output as <![CDATA[<div]]>]]>.

I still think this is correct. Please play with the test-cdata-close-early-empty-tag.html file in your browser. The browser parses the file exactly the same way.

We allow users on our app to enter code snippets and we store these code snippets after sanitising it with Antisamy plugin. Now for example if user enters
as a code snippet only <div is stored after sanitisation.

Please have a look at the paper attached to #102 for more details.

rbri commented 3 weeks ago

@akshay-kr do you still think i have to fix something here?

akshay-kr commented 3 weeks ago

@rbri I am now understanding what you are trying to explain. Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before. But it seems now this library is moving towards only html parsing so the same won't be supported anymore.

For example like this xml,

<xt:c-code xt:name="code" xt:version="1" xt:id="15ae0cc7-ded7-4a74-97b8-d66238d3c177"><xt:parameter xt:name="language">html</xt:parameter><xt:text-body><![CDATA[<div></div>]]></xt:text-body></xt:c-code>

We are parsing this XML and specifically the content inside CDATA and then storing it. Later when viewing we extract the content inside CDATA and render it on the web page.

rbri commented 3 weeks ago

Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before

Maybe we have to discuss this with the Antisamy peoples.

akshay-kr commented 2 weeks ago

Maybe we have to discuss this with the Antisamy peoples.

Sure. Raised an issue on Antisamy repo for the same https://github.com/nahsra/antisamy/issues/531

rbri commented 2 weeks ago

@akshay-kr had another look at this (from the HtmlUnit point of view).

HtmlUnit uses neko to parse html and XHtml documents - and in case of XHtml the current behaviour of the parser is wrong.

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

akshay-kr commented 2 weeks ago

Current plan: add another flag to disable the behaviour - but i have to think a bit about that.

Yup a flag to disable the behaviour will work. Or if there is a way user can pass what sort of document they are trying to parse then we can have separate logic for HTML and XHTML. Though I agree a short solve will be to disable the behaviour by a flag.

rbri commented 2 weeks ago

have added the feature but i fear there is more to do if you like to use it from antisamy

akshay-kr commented 2 weeks ago

@rbri Thanks a lot for adding this feature flag. Is it possible to back port this to 3.11.x branch as well? I am happy to create a PR for that. We just started upgrading to 3.11.1 version and upgrading directly to 4.5.0 will be time consuming I guess since many things have changed and we need to validate if anything is breaking or not.

akshay-kr commented 2 weeks ago

@rbri Created a PR for 3_11_3 branch https://github.com/HtmlUnit/htmlunit-neko/pull/126. Please review.

rbri commented 2 weeks ago

@akshay-kr for my points about such fix releases please have a look at my comment in issur #123

And thanks for that pr.

akshay-kr commented 2 weeks ago

@rbri I checked the comments on https://github.com/HtmlUnit/htmlunit-neko/issues/123. I will check internally regarding sponsoring thing and will inform you.

HtmlUnit / htmlunit-neko

CDATA scanner not scanning content properly #125