HtmlUnit / htmlunit-neko

HtmlUnit adaptation of NekoHtml
Apache License 2.0
17 stars 15 forks source link

Support for missing method in xerces #88

Closed garg23neha closed 7 months ago

garg23neha commented 9 months ago

@rbri In one of my company project, we are using getNonNormalizedValue() method from Xerces. But in the new implementation htmlunit-neko , this method is not present in XMLAttributesImpl.java.

Can you help me in understanding is there any alternate way for supporting getNonNormalizedValue() method?

rbri commented 9 months ago

@garg23neha will have a look...

can you please explain the use case a bit more

garg23neha commented 9 months ago

@rbri we are using it in a function where we want Xerces to not resolve entities when it parses attributes

rbri commented 9 months ago

we are using it in a function where we want Xerces to not resolve entities when it parses attributes

Ok, the getNonNormalizedValue() and most of the implementation was removed, because i have stripped neko down to the needs of HtmlUnit (and some other projects i'm aware of). And during the many performance optimizations we have removed nearly the whole support for this feature.

Have to think about a bit more....

rbri commented 9 months ago

@garg23neha do you need both versions of the attribute value or only the one one without the enriries resolved? And what do you do with the enresolvrd entities from the attributes?

garg23neha commented 9 months ago

@rbri We need both version of attribute value i.e getValue(int) and getNonNormalizedValue(int) both are needed. we don't resolve entities (  and so on) while printing attribute values. Is there a different way to disable this in neko-htmlunit now? We have a writer on top to exists to ensure we preserve CDATA blocks, without replacing them with entity escape sequences instead (simply because the resulting output for a code block is much nicer, and a little shorter, not for any other reason). Is there a way to enable this in neko-htmlunit now?

garg23neha commented 9 months ago

@rbri any inputs would be highly appreciated.

rbri commented 9 months ago

@rschwietzke what do you think about this - any idea how to do this without making the monster again slower?

rschwietzke commented 9 months ago

It depends on how we store that. If we have a second map for that, we certainly create more overhead again (at least 4 bytes per node without the map itself) which comes in at between about 24 to 100 bytes at least. How about making that an optional feature so that we can either use impl A or B of the attribute? But I can also imagine a specialist map/list for that (every second entry is a normalized attribute so to speak), which might not change things a lot or things like the first half is normalized, the second half is the original one.

P.S: The need to have an attribute being a DOMNode is one of the performance oddities we have. If not mistaken, that is due to XPath requirements. But that is rather HtmlUnit's than Neko's fault.

garg23neha commented 8 months ago

@rbri Are there any plans to bring it back?

sumollodha commented 8 months ago

+1 we also do use getNonNormalizedValue in our source code.

garg23neha commented 8 months ago

@rbri are there any plans on getting getNonNormalizedValue back to the code

rbri commented 8 months ago

will try

rbri commented 8 months ago

@garg23neha @sumollodha how should the getNonNormalizedValue() behave if the http://cyberneko.org/html/features/scanner/normalize-attrs is enabled?

in the old version it looks like the blanks are also removed from the getNonNormalizedValues method - but from the name it makes no sesnse for me.

garg23neha commented 7 months ago

@rbri I am not sure because we are not using the feature http://cyberneko.org/html/features/scanner/normalize-attrs

rbri commented 7 months ago

@garg23neha, @sumollodha the features is now really back please test with the 4.0.0-SNAPSHOT build and enable the

public static final String PLAIN_ATTRIBUTE_VALUES = "http://cyberneko.org/html/features/scanner/plain-attr-values";

feature. I also added some chars to the readme - please have a look and provide feedback.

garg23neha commented 7 months ago

@rbri HTMLScannerTest.reader() test is failing for me.

garg23neha commented 7 months ago

@rbri do you know any case where getValue and getNonNormalizedValue can return different result? In my project, we are using getNonNormalizedValue. So our use-case is when we get urls like http://www.example.org?name=bob&hat=bowler, the & will get get replaced otheriwse SAX parser will thing that &hat is a malformed entity definition. But getValue and getNonNormalizedValue are returning same result for "http://www.example.org?name=bob&hat=bowler". So I am not sure when can we see difference in the output of these two methods. Any pointers would be highly appreciated.

rbri commented 7 months ago

@rbri HTMLScannerTest.reader() test is failing for me.

Why not adding any hints about the failure you get - do you really think i can guess that.....

rbri commented 7 months ago

@rbri do you know any case where getValue and getNonNormalizedValue can return different result?

Yes of course, you have to

Or you have a look at the testcases i have added as part of this....

garg23neha commented 7 months ago

@rbri HTMLScannerTest.reader() test is failing for me.

Why not adding any hints about the failure you get - do you really think i can guess that.....

[ERROR] HTMLScannerTest.reader:340 expected: <(html (head )head (body (script Atype text/javascript "// )script )body )html> but was: <(html (head )head (body (script Atype text/javascript "// )script )body )html>

garg23neha commented 7 months ago

@rbri do you know any case where getValue and getNonNormalizedValue can return different result?

Yes of course, you have to

  • enable the feature PLAIN_ATTRIBUTE_VALUES (see above) and
  • place an valid entity reference in one of the attributes

Or you have a look at the testcases i have added as part of this.... @rbri In the previous version of neko-html when we had the support for NonNormalizedValue then also do we need to enable any feature?

rbri commented 7 months ago

@garg23neha test is fixed, thanks

In the previous version of neko-html when we had the support for NonNormalizedValue then also do we need to enable any feature?

I guess not - but for the current version we need this feature. The reason behind is simply performance and memory consumption. We spend a lot of time to optimize the parsing process to be faster and more memory friendly because some projects using this require this kind of optimization. Still not sure if adding the function back for you kills this other efforts in some way.

rbri commented 7 months ago

@sumollodha @garg23neha

because i lost the thread a bit i close this.

Please