asfimport commented 18 years ago

Hideaki Kimura (Bug 39092): JMeter uses 1.3 of htmlparser, not the latest version 1.6 ,which has modified many bugs and has strong NodeFilters. And, just replacing htmlparser.jar in distributed JMeter with latest htmlparser doesn't work because of the use of incompatible API s in HtmlParserHTMLParser.java. This makes the use of htmlparser in BeanShell Samper a little difficult. This is why JMeter should UPDATE the htmlparser.

However, as htmlparser is under LGPL while JMeter is under Apache License, we have to make JMeter working well without htmlpaser for updating the donated htmlparser codes to the latest, 1.6 . This is why JMeter should ISOLATE the htmlparser.

Severity: normal OS: All

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Created attachment HtmlParserHTMLParser.patch: a patch for HtmlParserHTMLParser to UPDATE the htmlparser

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Created attachment HTMLParser.patch: a patch for HTMLParser to ISOLATE the htmlparser

HTMLParser.patch

````diff --- C:\oldHTMLParser.java Fri Mar 24 10:47:08 2006 +++ C:\newHTMLParser.java Fri Mar 24 10:46:48 2006 @@ -91,7 +91,12 @@ } catch (ClassNotFoundException e) { throw new HTMLParseError(e); } - log.info("Created " + htmlParserClassName); + if (!pars.isValid()) { + log.warn(htmlParserClassName + " can't be used. Instead, RegexpHTMLParser is used."); + pars = new RegexpHTMLParser(); // RegexpHTMLParser is always ready to use. + } else { + log.info("Created " + htmlParserClassName); + } if (pars.isReusable()) { parsers.put(htmlParserClassName, pars);// cache the parser } @@ -218,6 +223,16 @@ */ protected boolean isReusable() { return false; + } + + + /** + * Parsers should over-ride this method if the parser might be + * "not ready" to use in some situation. + * @return true if the HTMLParser is ready to use. + */ + protected boolean isValid() { + return true; } // ////////////////////////// TEST CODE FOLLOWS ````

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): These patches enable JMeter to work with htmlparser 1.6 and work well even if they can't detect htmlparser.jar.

As for development environment, src/htmlparser should be deleted and related entries in build.xml or eclipse.classpath should be removed. Instead, filterbuilder.jar htmllexer.jar htmlparser.jar sax2.jar thumbelina.jar included in the latest htmlparser http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712 should be included in classpath for compiling.

As for binary build, htmlparser.jar should be no longer included so that users can install htmlparser as an option.

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): Thanks, I'll take a look at this shortly

asfimport commented 18 years ago

peter lin (migrated from Bugzilla): There is a significant downside to ask users to download HTMLParser from sourceforge. Many users complain about this, so it needs to be documented clearly. We've seen this with the Webservice sampler, which requires users download external jars. I disagree with delete htmlparser in the src directory. Htmlparser developers were kind enough to donate a snapshot under apache license and I still find it valuable. Instead, we should make it configurable, or get rid of JTidy and htmlparser all together. We currently have JTidy, regexp and htmlparser. The original reason for using htmlparser is it's easier to use than JTidy and not significantly slower than regexp.

my 2 cents on the issue.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Thanks for concerning, peter.

Exactly, downloading manually is troublesome. But, it still works without htmlparser, apart from the performance down which comes from the low performance of RegexHTMLParser you mentioned.

Who have to download htmlparser manually are only those who put "retrieve all" in the HTTP Sampler on and also have to care about the performance of HTTP Sampler. In most case, I think, users don't have to do anything more than now.

But, anyway, the benefit of using donated codes still exists as you say. Then... how about to ask htmlparser developers team again? Is it too intrusive?

(In reply to comment 5)

There is a significant downside to ask users to download HTMLParser from sourceforge. Many users complain about this, so it needs to be documented clearly. We've seen this with the Webservice sampler, which requires users download external jars. I disagree with delete htmlparser in the src directory. Htmlparser developers were kind enough to donate a snapshot under apache license and I still find it valuable. Instead, we should make it configurable, or get rid of JTidy and htmlparser all together. We currently have JTidy, regexp and htmlparser. The original reason for using htmlparser is it's easier to use than JTidy and not significantly slower than regexp.

my 2 cents on the issue.

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): The following property is used to define the parser interface class:

htmlParser.className

so one should be able to create a new class to use the new API - instead of replacing the existing class as currently proposed.

If a user wants the new parser, then they just download the new jars, and update the parser property.

OK?

asfimport commented 18 years ago

peter lin (migrated from Bugzilla): As usual, you have great ideas sebb. that sounds like a good solution to me. peter

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): One thing is that, compiling current HtmlParserHTMLParser needs htmlparser 1.3 and that of new HtmlParserHTMLParser needs 1.6.

Unfortunately, it's impossible to make a new HtmlParserHTMLParser which can be compiled with htmlparser 1.6 as well as 1.3 . They are totally incompatible. Only one of them can be in the JMeter source code.

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): I think I can get round the compilation problem.

However, the problem I have at the moment is that the patch does not work for me.

Please can you attach the full new parser file?

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Created attachment HtmlParserHTMLParser.java: a whole source code of new HTMLParserHTMLParser

HtmlParserHTMLParser.java

````java // $Header$ /* * Copyright 2003-2004 The Apache Software Foundation. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package org.apache.jmeter.protocol.http.parser; import java.net.MalformedURLException; import java.net.URL; import java.util.Iterator; import org.apache.jorphan.logging.LoggingManager; import org.apache.log.Logger; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Tag; import org.htmlparser.tags.AppletTag; import org.htmlparser.tags.BaseHrefTag; import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.FrameTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.InputTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; /** * HtmlParser implementation using SourceForge's HtmlParser. * * @version $Revision: 325588 $ updated on $Date: 2005-08-04 10:31:09 +0900 $ */ class HtmlParserHTMLParser extends HTMLParser { /** Used to store the Logger (used for debug and error messages). */ transient private static Logger log = LoggingManager.getLoggerForClass(); protected HtmlParserHTMLParser() throws NoClassDefFoundError { super(); } /** {@inheritDoc}. **/ protected boolean isValid() { // check whether htmlparser exists. try { new Parser(); } catch (NoClassDefFoundError e) { return false; } return true; } protected boolean isReusable() { return true; } /* * (non-Javadoc) * * @see org.apache.jmeter.protocol.http.parser.HtmlParser#getEmbeddedResourceURLs(byte[], * java.net.URL) */ public Iterator getEmbeddedResourceURLs(byte[] html, URL baseUrl, URLCollection urls) throws HTMLParseException { log.debug("Parsing html of: " + baseUrl); Parser htmlParser = null; try { String contents = new String(html); htmlParser = new Parser(); htmlParser.setInputHTML(contents); } catch (Exception e) { throw new HTMLParseException(e); } // Now parse the DOM tree try { // we start to iterate through the elements parseNodes(htmlParser.elements(), baseUrl, urls); log.debug("End : parseNodes"); } catch (ParserException e) { throw new HTMLParseException(e); } return urls.iterator(); } /** * Recursively parse all nodes to pick up all URL s. * @see e the nodes to be parsed * @see baseUrl Base URL from which the HTML code was obtained * @see urls URLCollection */ private void parseNodes(final NodeIterator e, URL baseUrl, final URLCollection urls) throws HTMLParseException, ParserException { while(e.hasMoreNodes()) { Node node = e.nextNode(); // a url is always in a Tag. if (node instanceof Tag == false) { continue; } Tag tag = (Tag) node; String tagname=tag.getTagName(); String binUrlStr = null; // first we check to see if body tag has a // background set if (tag instanceof BodyTag) { binUrlStr = tag.getAttribute("background"); } else if (tag instanceof BaseHrefTag) { BaseHrefTag baseHref = (BaseHrefTag) tag; String baseref = baseHref.getBaseUrl().toString(); try { if (!baseref.equals(""))// Bugzilla 30713 { baseUrl = new URL(baseUrl, baseHref.getBaseUrl() + "/"); } } catch (MalformedURLException e1) { throw new HTMLParseException(e1); } } else if (tag instanceof ImageTag) { ImageTag image = (ImageTag) tag; binUrlStr = image.getImageURL(); } else if (tag instanceof AppletTag) { // look for applets // This will only work with an Applet .class file. // Ideally, this should be upgraded to work with Objects (IE) // and archives (.jar and .zip) files as well. AppletTag applet = (AppletTag) tag; binUrlStr = applet.getAppletClass(); } else if (tag instanceof InputTag) { // we check the input tag type for image String strType = tag.getAttribute("type"); if (strType != null && strType.equalsIgnoreCase("image")) { // then we need to download the binary binUrlStr = tag.getAttribute("src"); } } else if (tag instanceof LinkTag) { LinkTag link = (LinkTag) tag; if (link.getChild(0) instanceof ImageTag) { ImageTag img = (ImageTag) link.getChild(0); binUrlStr = img.getImageURL(); } } else if (tag instanceof ScriptTag) { binUrlStr = tag.getAttribute("src"); } else if (tag instanceof FrameTag) { binUrlStr = tag.getAttribute("src"); } else if (tagname.equalsIgnoreCase("EMBED") || tagname.equalsIgnoreCase("BGSOUND")){ binUrlStr = tag.getAttribute("src"); } else if (tagname.equalsIgnoreCase("LINK")) { if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { binUrlStr = tag.getAttribute("href"); } } if (binUrlStr != null) { urls.addURL(binUrlStr, baseUrl); } // second, if the tag was a composite tag, // recursively parse its children. if (tag instanceof CompositeTag) { CompositeTag composite = (CompositeTag) tag; parseNodes(composite.elements(), baseUrl, urls); } } } } ````

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Created attachment HTMLParser.java: a whole source code of new HTMLParser

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Was it in incorrect format? If so, sorry. I used WinMerge to make the patches. Anyaway, I uploaded the whole file.

(In reply to comment 10)

I think I can get round the compilation problem.

However, the problem I have at the moment is that the patch does not work for me.

Please can you attach the full new parser file?

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): There's a problem with the parsing code - it does not seem to handle BASEREF tags properly. They are detected, but the new base is not saved for subsequent tags.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): !!! So so sorry, silly mistake. It just overwrites the pointer, not pointee. I'll modify it right now.

(In reply to comment 14)

There's a problem with the parsing code - it does not seem to handle BASEREF tags properly. They are detected, but the new base is not saved for subsequent tags.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): sorry, what a silly mistake..

Created attachment HtmlParserHTMLParser.java: a modified source code of new HTMLParserHTMLParser

HtmlParserHTMLParser.java

````java // $Header$ /* * Copyright 2003-2004 The Apache Software Foundation. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package org.apache.jmeter.protocol.http.parser; import java.net.MalformedURLException; import java.net.URL; import java.util.Iterator; import org.apache.jorphan.logging.LoggingManager; import org.apache.log.Logger; import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.Tag; import org.htmlparser.tags.AppletTag; import org.htmlparser.tags.BaseHrefTag; import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.CompositeTag; import org.htmlparser.tags.FrameTag; import org.htmlparser.tags.ImageTag; import org.htmlparser.tags.InputTag; import org.htmlparser.tags.LinkTag; import org.htmlparser.tags.ScriptTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; /** * HtmlParser implementation using SourceForge's HtmlParser. * * @version $Revision: 325588 $ updated on $Date: 2005-08-04 10:31:09 +0900 $ */ class HtmlParserHTMLParser extends HTMLParser { /** Used to store the Logger (used for debug and error messages). */ transient private static Logger log = LoggingManager.getLoggerForClass(); protected HtmlParserHTMLParser() throws NoClassDefFoundError { super(); } /** {@inheritDoc}. **/ protected boolean isValid() { // check whether htmlparser exists. try { new Parser(); } catch (NoClassDefFoundError e) { return false; } return true; } protected boolean isReusable() { return true; } /* * (non-Javadoc) * * @see org.apache.jmeter.protocol.http.parser.HtmlParser#getEmbeddedResourceURLs(byte[], * java.net.URL) */ public Iterator getEmbeddedResourceURLs(byte[] html, URL baseUrl, URLCollection urls) throws HTMLParseException { log.debug("Parsing html of: " + baseUrl); Parser htmlParser = null; try { String contents = new String(html); htmlParser = new Parser(); htmlParser.setInputHTML(contents); } catch (Exception e) { throw new HTMLParseException(e); } // Now parse the DOM tree try { // we start to iterate through the elements parseNodes(htmlParser.elements(), new URLPointer(baseUrl), urls); log.debug("End : parseNodes"); } catch (ParserException e) { throw new HTMLParseException(e); } return urls.iterator(); } /** * A dummy class to pass the pointer of URL. */ private static class URLPointer { private URLPointer(URL newUrl) { url = newUrl; } private URL url; } /** * Recursively parse all nodes to pick up all URL s. * @see e the nodes to be parsed * @see baseUrl Base URL from which the HTML code was obtained * @see urls URLCollection */ private void parseNodes(final NodeIterator e, final URLPointer baseUrl, final URLCollection urls) throws HTMLParseException, ParserException { while(e.hasMoreNodes()) { Node node = e.nextNode(); // a url is always in a Tag. if (node instanceof Tag == false) { continue; } Tag tag = (Tag) node; String tagname=tag.getTagName(); String binUrlStr = null; // first we check to see if body tag has a // background set if (tag instanceof BodyTag) { binUrlStr = tag.getAttribute("background"); } else if (tag instanceof BaseHrefTag) { BaseHrefTag baseHref = (BaseHrefTag) tag; String baseref = baseHref.getBaseUrl().toString(); try { if (!baseref.equals(""))// Bugzilla 30713 { baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl() + "/"); } } catch (MalformedURLException e1) { throw new HTMLParseException(e1); } } else if (tag instanceof ImageTag) { ImageTag image = (ImageTag) tag; binUrlStr = image.getImageURL(); } else if (tag instanceof AppletTag) { // look for applets // This will only work with an Applet .class file. // Ideally, this should be upgraded to work with Objects (IE) // and archives (.jar and .zip) files as well. AppletTag applet = (AppletTag) tag; binUrlStr = applet.getAppletClass(); } else if (tag instanceof InputTag) { // we check the input tag type for image String strType = tag.getAttribute("type"); if (strType != null && strType.equalsIgnoreCase("image")) { // then we need to download the binary binUrlStr = tag.getAttribute("src"); } } else if (tag instanceof LinkTag) { LinkTag link = (LinkTag) tag; if (link.getChild(0) instanceof ImageTag) { ImageTag img = (ImageTag) link.getChild(0); binUrlStr = img.getImageURL(); } } else if (tag instanceof ScriptTag) { binUrlStr = tag.getAttribute("src"); } else if (tag instanceof FrameTag) { binUrlStr = tag.getAttribute("src"); } else if (tagname.equalsIgnoreCase("EMBED") || tagname.equalsIgnoreCase("BGSOUND")){ binUrlStr = tag.getAttribute("src"); } else if (tagname.equalsIgnoreCase("LINK")) { if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { binUrlStr = tag.getAttribute("href"); } } if (binUrlStr != null) { urls.addURL(binUrlStr, baseUrl.url); } // second, if the tag was a composite tag, // recursively parse its children. if (tag instanceof CompositeTag) { CompositeTag composite = (CompositeTag) tag; parseNodes(composite.elements(), baseUrl, urls); } } } } ````

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): That works better, though I had to change:

baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl() + "/");

to

baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl());

to avoid getting // in URLs.

==

By the way, I found I only needed htmlparser.jar for compiling and running.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Thanks for pointing out.

As you said, it seems to need only htmlparser.jar. It worked in my environment, too.

If you can get round the compilation problem and are going to add a new class instead of replacing it, please discard my changing at HTMLParser.java and remove the isValid() function from new HtmlParserHTMLParser.java as well as give it a new name, like HtmlParserHTMLParser16(?) .

I apologize if you have already done or planed it (most likely so...)

(In reply to comment 17)

That works better, though I had to change:

baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl() + "/");

to

baseUrl.url = new URL(baseUrl.url, baseHref.getBaseUrl());

to avoid getting // in URLs.

==

By the way, I found I only needed htmlparser.jar for compiling and running.

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): The attached jar contains my version of the source, and the compiled class.

I've kept the same name as the existing class. To use it: Replace the jmeter htmlparser.jar with the SF one. Put the htmlparserpaser.jar in the lib directory. You may also need to delete the htmlparserhtmlparser.class file from the Jmeter http jar. I've not decided how best to build/test the new class automatically yet - at present I'm using a separate Eclipse project, which is not ideal. But in the meantime, please try it, and try to break it...

Created attachment htmlparserparser.jar: htmlparser source and class

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla):

Replace the jmeter htmlparser.jar with the SF one. Put the htmlparserpaser.jar in the lib directory. You may also need to delete the htmlparserhtmlparser.class file from the Jmeter http jar. All right. It worked well, though I need to delete the htmlparserhtmlparser.class as you said.

And, I found one more bug of HtmlParserHTMLParser, in Line 163 if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { should be if (tag.getAttribute("rel") != null && tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { NPE happens during retrieving http://db.apache.org/ or other sites which have "</link>" (xhtml).

I've not decided how best to build/test the new class automatically yet - at present I'm using a separate Eclipse project, which is not ideal. But in the meantime, please try it, and try to break it... I agree with you. It's better to think about how to provide this , though I feel like asking the htmlparser team again at some day.

Thanks, sebb.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Incidentally, I would like to show how valuable the combination of htmlparser1.6 and BeanShell Sampler is. The attached jmx file demonstrates it. This jmx accesses www.apache.org and retrieve link-tags by this beanshell codes:

parser = new Parser(); parser.setInputHTML(new String(ctx.getPreviousResult().getResponseData(), "iso-8859-1")); // pickup apache sites shown on the left. aTags = parser.parse(new AndFilter(new TagNameFilter("td"), new HasAttributeFilter("class", "navleft"))) .extractAllNodesThatMatch(new LinkRegexFilter("http://.*/"), true);

htmlparser's NodeFilters are very cool!

and after that, put them into variables: for(i = 0; i < aTags.size() && i < 10; ++i) { href = aTags.elementAt(i).getAttribute("href"); server = href.substring("http://".length(), href.length() - 1); log.info("The server of found site is : " + server); vars.put("SITESSERVER" + (i + 1), server); }

and ForEach Controller and parameterized HTTP Sampler calls each retrieved URL.

The above is only an example, this process has unprecedented flexibility, maintainability and easiness to correlate. It hasn't been provided by LoadRunner/Rational or any other test tool! I'm so grateful to the developers of htmlparser and BeanShell Sampler. Really Exciting!

Created attachment BSHandHTMLParser.jmx: A jmx file I used for testing the combination of BeanShell Sampler and htmlparser1.6

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): (In reply to comment 20)

And, I found one more bug of HtmlParserHTMLParser, in Line 163 if (tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { should be if (tag.getAttribute("rel") != null && tag.getAttribute("rel").equalsIgnoreCase("stylesheet")) { NPE happens during retrieving http://db.apache.org/ or other sites which have "</link>" (xhtml).

OK.

We're not interested in end tags - perhaps they should be filtered out.

BTW, why use (tagname.equalsIgnoreCase("LINK")) rather than (tag instanceof LinkTag?)

> I've not decided how best to build/test the new class automatically yet - at > present I'm using a separate Eclipse project, which is not ideal. > But in the meantime, please try it, and try to break it... I agree with you. It's better to think about how to provide this , though I feel like asking the htmlparser team again at some day.

If they are willing to additionally licence the binary under an ASF-compatible license, then it would probably solve all the problems...

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla):

BTW, why use (tagname.equalsIgnoreCase("LINK")) rather than (tag instanceof LinkTag?) That is because LinkTag represents "A" tag, not "LINK" tag. Instead, htmlparser 1.3 has "LinkTagTag", but 1.6 doesn't.

If they are willing to additionally licence the binary under an ASF-compatible license, then it would probably solve all the problems... Yeah, hoping so. As peter said, I think they are so kind guys.

asfimport commented 18 years ago

Hideaki Kimura (migrated from Bugzilla): Goood news! Mr. Derrick Oswald, the lead developer of htmlparser, generously permits JMeter project to re-distribute htmlparser.jar.

You are welcome to use the HTML Parser, in either binary or source code form, to be included with JMeter.

Sincerely,

Derrick Oswald HTML Parser Lead Programmer

Thanks for advice, Peter.

asfimport commented 18 years ago

Sebb (migrated from Bugzilla): I've added a new parser class: HtmlParserHTMLParser16 to the 2.1 branch

Just set the parser property accordingly, and replace the htmlparser jar with version 1.6 (or later)

apache / jmeter

htmlparser should be updated and isolated #1704

htmlparser's NodeFilters are very cool!