Closed garg23neha closed 7 months ago
@garg23neha will have a look...
can you please explain the use case a bit more
@rbri we are using it in a function where we want Xerces to not resolve entities when it parses attributes
we are using it in a function where we want Xerces to not resolve entities when it parses attributes
Ok, the getNonNormalizedValue() and most of the implementation was removed, because i have stripped neko down to the needs of HtmlUnit (and some other projects i'm aware of). And during the many performance optimizations we have removed nearly the whole support for this feature.
Have to think about a bit more....
@garg23neha do you need both versions of the attribute value or only the one one without the enriries resolved? And what do you do with the enresolvrd entities from the attributes?
@rbri We need both version of attribute value i.e getValue(int) and getNonNormalizedValue(int) both are needed. we don't resolve entities ( and so on) while printing attribute values. Is there a different way to disable this in neko-htmlunit now? We have a writer on top to exists to ensure we preserve CDATA blocks, without replacing them with entity escape sequences instead (simply because the resulting output for a code block is much nicer, and a little shorter, not for any other reason). Is there a way to enable this in neko-htmlunit now?
@rbri any inputs would be highly appreciated.
@rschwietzke what do you think about this - any idea how to do this without making the monster again slower?
It depends on how we store that. If we have a second map for that, we certainly create more overhead again (at least 4 bytes per node without the map itself) which comes in at between about 24 to 100 bytes at least. How about making that an optional feature so that we can either use impl A or B of the attribute? But I can also imagine a specialist map/list for that (every second entry is a normalized attribute so to speak), which might not change things a lot or things like the first half is normalized, the second half is the original one.
P.S: The need to have an attribute being a DOMNode is one of the performance oddities we have. If not mistaken, that is due to XPath requirements. But that is rather HtmlUnit's than Neko's fault.
@rbri Are there any plans to bring it back?
+1 we also do use getNonNormalizedValue
in our source code.
@rbri are there any plans on getting getNonNormalizedValue back to the code
will try
@garg23neha @sumollodha how should the getNonNormalizedValue() behave if the http://cyberneko.org/html/features/scanner/normalize-attrs is enabled?
in the old version it looks like the blanks are also removed from the getNonNormalizedValues method - but from the name it makes no sesnse for me.
@rbri I am not sure because we are not using the feature http://cyberneko.org/html/features/scanner/normalize-attrs
@garg23neha, @sumollodha the features is now really back please test with the 4.0.0-SNAPSHOT build and enable the
public static final String PLAIN_ATTRIBUTE_VALUES = "http://cyberneko.org/html/features/scanner/plain-attr-values";
feature. I also added some chars to the readme - please have a look and provide feedback.
@rbri HTMLScannerTest.reader() test is failing for me.
@rbri do you know any case where getValue and getNonNormalizedValue can return different result? In my project, we are using getNonNormalizedValue. So our use-case is when we get urls like http://www.example.org?name=bob&hat=bowler, the & will get get replaced otheriwse SAX parser will thing that &hat is a malformed entity definition. But getValue and getNonNormalizedValue are returning same result for "http://www.example.org?name=bob&hat=bowler". So I am not sure when can we see difference in the output of these two methods. Any pointers would be highly appreciated.
@rbri HTMLScannerTest.reader() test is failing for me.
Why not adding any hints about the failure you get - do you really think i can guess that.....
@rbri do you know any case where getValue and getNonNormalizedValue can return different result?
Yes of course, you have to
Or you have a look at the testcases i have added as part of this....
@rbri HTMLScannerTest.reader() test is failing for me.
Why not adding any hints about the failure you get - do you really think i can guess that.....
[ERROR] HTMLScannerTest.reader:340 expected: <(html (head )head (body (script Atype text/javascript "// )script )body )html> but was: <(html (head )head (body (script Atype text/javascript "// )script )body )html>
@rbri do you know any case where getValue and getNonNormalizedValue can return different result?
Yes of course, you have to
- enable the feature PLAIN_ATTRIBUTE_VALUES (see above) and
- place an valid entity reference in one of the attributes
Or you have a look at the testcases i have added as part of this.... @rbri In the previous version of neko-html when we had the support for NonNormalizedValue then also do we need to enable any feature?
@garg23neha test is fixed, thanks
In the previous version of neko-html when we had the support for NonNormalizedValue then also do we need to enable any feature?
I guess not - but for the current version we need this feature. The reason behind is simply performance and memory consumption. We spend a lot of time to optimize the parsing process to be faster and more memory friendly because some projects using this require this kind of optimization. Still not sure if adding the function back for you kills this other efforts in some way.
@sumollodha @garg23neha
because i lost the thread a bit i close this.
Please
@rbri In one of my company project, we are using getNonNormalizedValue() method from Xerces. But in the new implementation htmlunit-neko , this method is not present in XMLAttributesImpl.java.
Can you help me in understanding is there any alternate way for supporting getNonNormalizedValue() method?