Closed asfimport closed 18 years ago
Hideaki Kimura (migrated from Bugzilla): Created attachment HtmlParserHTMLParser.patch: a patch for HtmlParserHTMLParser to UPDATE the htmlparser
Hideaki Kimura (migrated from Bugzilla): Created attachment HTMLParser.patch: a patch for HTMLParser to ISOLATE the htmlparser
Hideaki Kimura (migrated from Bugzilla): These patches enable JMeter to work with htmlparser 1.6 and work well even if they can't detect htmlparser.jar.
As for development environment, src/htmlparser should be deleted and related entries in build.xml or eclipse.classpath should be removed. Instead, filterbuilder.jar htmllexer.jar htmlparser.jar sax2.jar thumbelina.jar included in the latest htmlparser http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712 should be included in classpath for compiling.
As for binary build, htmlparser.jar should be no longer included so that users can install htmlparser as an option.
Sebb (migrated from Bugzilla): Thanks, I'll take a look at this shortly
peter lin (migrated from Bugzilla): There is a significant downside to ask users to download HTMLParser from sourceforge. Many users complain about this, so it needs to be documented clearly. We've seen this with the Webservice sampler, which requires users download external jars. I disagree with delete htmlparser in the src directory. Htmlparser developers were kind enough to donate a snapshot under apache license and I still find it valuable. Instead, we should make it configurable, or get rid of JTidy and htmlparser all together. We currently have JTidy, regexp and htmlparser. The original reason for using htmlparser is it's easier to use than JTidy and not significantly slower than regexp.
my 2 cents on the issue.
Hideaki Kimura (migrated from Bugzilla): Thanks for concerning, peter.
Exactly, downloading manually is troublesome. But, it still works without htmlparser, apart from the performance down which comes from the low performance of RegexHTMLParser you mentioned.
Who have to download htmlparser manually are only those who put "retrieve all" in the HTTP Sampler on and also have to care about the performance of HTTP Sampler. In most case, I think, users don't have to do anything more than now.
But, anyway, the benefit of using donated codes still exists as you say. Then... how about to ask htmlparser developers team again? Is it too intrusive?
(In reply to comment 5)
There is a significant downside to ask users to download HTMLParser from sourceforge. Many users complain about this, so it needs to be documented clearly. We've seen this with the Webservice sampler, which requires users download external jars. I disagree with delete htmlparser in the src directory. Htmlparser developers were kind enough to donate a snapshot under apache license and I still find it valuable. Instead, we should make it configurable, or get rid of JTidy and htmlparser all together. We currently have JTidy, regexp and htmlparser. The original reason for using htmlparser is it's easier to use than JTidy and not significantly slower than regexp.
my 2 cents on the issue.
Sebb (migrated from Bugzilla): The following property is used to define the parser interface class:
htmlParser.className
so one should be able to create a new class to use the new API - instead of replacing the existing class as currently proposed.
If a user wants the new parser, then they just download the new jars, and update the parser property.
OK?
peter lin (migrated from Bugzilla): As usual, you have great ideas sebb. that sounds like a good solution to me. peter
Hideaki Kimura (migrated from Bugzilla): One thing is that, compiling current HtmlParserHTMLParser needs htmlparser 1.3 and that of new HtmlParserHTMLParser needs 1.6.
Unfortunately, it's impossible to make a new HtmlParserHTMLParser which can be compiled with htmlparser 1.6 as well as 1.3 . They are totally incompatible. Only one of them can be in the JMeter source code.
Sebb (migrated from Bugzilla): I think I can get round the compilation problem.
However, the problem I have at the moment is that the patch does not work for me.
Please can you attach the full new parser file?
Hideaki Kimura (migrated from Bugzilla): Created attachment HtmlParserHTMLParser.java: a whole source code of new HTMLParserHTMLParser
Hideaki Kimura (migrated from Bugzilla): Created attachment HTMLParser.java: a whole source code of new HTMLParser
Hideaki Kimura (migrated from Bugzilla): Was it in incorrect format? If so, sorry. I used WinMerge to make the patches. Anyaway, I uploaded the whole file.
(In reply to comment 10)
I think I can get round the compilation problem.
However, the problem I have at the moment is that the patch does not work for me.
Please can you attach the full new parser file?
Sebb (migrated from Bugzilla): There's a problem with the parsing code - it does not seem to handle BASEREF tags properly. They are detected, but the new base is not saved for subsequent tags.
Hideaki Kimura (migrated from Bugzilla): !!! So so sorry, silly mistake. It just overwrites the pointer, not pointee. I'll modify it right now.
(In reply to comment 14)
There's a problem with the parsing code - it does not seem to handle BASEREF tags properly. They are detected, but the new base is not saved for subsequent tags.
Hideaki Kimura (migrated from Bugzilla): sorry, what a silly mistake..
Created attachment HtmlParserHTMLParser.java: a modified source code of new HTMLParserHTMLParser
Sebb (migrated from Bugzilla): That works better, though I had to change:
baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl() + "/");
to
baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl());
to avoid getting // in URLs.
==
By the way, I found I only needed htmlparser.jar for compiling and running.
Hideaki Kimura (migrated from Bugzilla): Thanks for pointing out.
As you said, it seems to need only htmlparser.jar. It worked in my environment, too.
If you can get round the compilation problem and are going to add a new class instead of replacing it, please discard my changing at HTMLParser.java and remove the isValid() function from new HtmlParserHTMLParser.java as well as give it a new name, like HtmlParserHTMLParser16(?) .
I apologize if you have already done or planed it (most likely so...)
(In reply to comment 17)
That works better, though I had to change:
baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl() + "/");
to
baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl());
to avoid getting // in URLs.
==
By the way, I found I only needed htmlparser.jar for compiling and running.
Sebb (migrated from Bugzilla): The attached jar contains my version of the source, and the compiled class.
I've kept the same name as the existing class. To use it: Replace the jmeter htmlparser.jar with the SF one. Put the htmlparserpaser.jar in the lib directory. You may also need to delete the htmlparserhtmlparser.class file from the Jmeter http jar. I've not decided how best to build/test the new class automatically yet - at present I'm using a separate Eclipse project, which is not ideal. But in the meantime, please try it, and try to break it...
Created attachment htmlparserparser.jar: htmlparser source and class
Hideaki Kimura (migrated from Bugzilla):
Replace the jmeter htmlparser.jar with the SF one. Put the htmlparserpaser.jar in the lib directory. You may also need to delete the htmlparserhtmlparser.class file from the Jmeter http jar. All right. It worked well, though I need to delete the htmlparserhtmlparser.class as you said.
And, I found one more bug of HtmlParserHTMLParser, in Line 163 if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { should be if (tag.getAttribute("rel") != null && tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { NPE happens during retrieving http://db.apache.org/ or other sites which have "</link>" (xhtml).
I've not decided how best to build/test the new class automatically yet - at present I'm using a separate Eclipse project, which is not ideal. But in the meantime, please try it, and try to break it... I agree with you. It's better to think about how to provide this , though I feel like asking the htmlparser team again at some day.
Thanks, sebb.
Hideaki Kimura (migrated from Bugzilla): Incidentally, I would like to show how valuable the combination of htmlparser1.6 and BeanShell Sampler is. The attached jmx file demonstrates it. This jmx accesses www.apache.org and retrieve link-tags by this beanshell codes:
parser = new Parser(); parser.setInputHTML(new String(ctx.getPreviousResult().getResponseData(), "iso-8859-1")); // pickup apache sites shown on the left. aTags = parser.parse(new AndFilter(new TagNameFilter("td"), new HasAttributeFilter("class", "navleft"))) .extractAllNodesThatMatch(new LinkRegexFilter("http://.*/"), true);
and after that, put them into variables: for(i = 0; i < aTags.size() && i < 10; ++i) { href = aTags.elementAt(i).getAttribute("href"); server = href.substring("http://".length(), href.length() - 1); log.info("The server of found site is : " + server); vars.put("SITESSERVER" + (i + 1), server); }
and ForEach Controller and parameterized HTTP Sampler calls each retrieved URL.
The above is only an example, this process has unprecedented flexibility, maintainability and easiness to correlate. It hasn't been provided by LoadRunner/Rational or any other test tool! I'm so grateful to the developers of htmlparser and BeanShell Sampler. Really Exciting!
Created attachment BSHandHTMLParser.jmx: A jmx file I used for testing the combination of BeanShell Sampler and htmlparser1.6
Sebb (migrated from Bugzilla): (In reply to comment 20)
And, I found one more bug of HtmlParserHTMLParser, in Line 163 if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { should be if (tag.getAttribute("rel") != null && tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { NPE happens during retrieving http://db.apache.org/ or other sites which have "</link>" (xhtml).
OK.
We're not interested in end tags - perhaps they should be filtered out.
BTW, why use (tagname.equalsIgnoreCase("LINK")) rather than (tag instanceof LinkTag?)
> I've not decided how best to build/test the new class automatically yet - at > present I'm using a separate Eclipse project, which is not ideal. > But in the meantime, please try it, and try to break it... I agree with you. It's better to think about how to provide this , though I feel like asking the htmlparser team again at some day.
If they are willing to additionally licence the binary under an ASF-compatible license, then it would probably solve all the problems...
Hideaki Kimura (migrated from Bugzilla):
BTW, why use (tagname.equalsIgnoreCase("LINK")) rather than (tag instanceof LinkTag?) That is because LinkTag represents "A" tag, not "LINK" tag. Instead, htmlparser 1.3 has "LinkTagTag", but 1.6 doesn't.
If they are willing to additionally licence the binary under an ASF-compatible license, then it would probably solve all the problems... Yeah, hoping so. As peter said, I think they are so kind guys.
Hideaki Kimura (migrated from Bugzilla): Goood news! Mr. Derrick Oswald, the lead developer of htmlparser, generously permits JMeter project to re-distribute htmlparser.jar.
You are welcome to use the HTML Parser, in either binary or source code form, to be included with JMeter.
Sincerely,
Derrick Oswald HTML Parser Lead Programmer
Thanks for advice, Peter.
Sebb (migrated from Bugzilla): I've added a new parser class: HtmlParserHTMLParser16 to the 2.1 branch
Just set the parser property accordingly, and replace the htmlparser jar with version 1.6 (or later)
Hideaki Kimura (Bug 39092): JMeter uses 1.3 of htmlparser, not the latest version 1.6 ,which has modified many bugs and has strong NodeFilters. And, just replacing htmlparser.jar in distributed JMeter with latest htmlparser doesn't work because of the use of incompatible API s in HtmlParserHTMLParser.java. This makes the use of htmlparser in BeanShell Samper a little difficult. This is why JMeter should UPDATE the htmlparser.
However, as htmlparser is under LGPL while JMeter is under Apache License, we have to make JMeter working well without htmlpaser for updating the donated htmlparser codes to the latest, 1.6 . This is why JMeter should ISOLATE the htmlparser.
Severity: normal OS: All