HtmlUnit / htmlunit-neko

HtmlUnit adaptation of NekoHtml
Apache License 2.0
17 stars 15 forks source link

Htmlunit-NekoHtml Parser

The Htmlunit-NekoHtml Parser is a HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces.
The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents.
NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

The Htmlunit-NekoHtml Parser has no external dependencies at all, requires Java 8 and works also on Android.
The Htmlunit-NekoHtml Parser is used by Htmlunit.

Maven Central

:heart: Sponsor

Project News

Developer Blog

HtmlUnit@mastodon | HtmlUnit@Twitter

Latest release Version 4.6.0 / November 05, 2024

CVE-2022-29546

Htmlunit-NekoHtml Parser suffers from a denial of service vulnerability on versions 2.60.0 and below. A specifically crafted input regarding the parsing of processing instructions leads to heap memory consumption.

CVE-2022-28366

Htmlunit-NekoHtml Parser suffers from a denial of service via crafted Processing Instruction vulnerability on versions 2.26 and below.

Get it!

Maven

Add to your pom.xml:

<dependency>
    <groupId>org.htmlunit</groupId>
    <artifactId>neko-htmlunit</artifactId>
    <version>4.6.0</version>
</dependency>

Gradle

Add to your build.gradle:

implementation group: 'org.htmlunit', name: 'neko-htmlunit', version: '4.6.0'

HowTo use

DOMParser

The DOMParser can be used together with the simple build in DOM implementation or with your own.

final String html =
            " <!DOCTYPE html>\n"
            + "<html>\n"
            + "<body>\n"
            + "<h1>NekoHtml</h1>\n"
            + "</body>\n"
            + "</html>";

final StringReader sr = new StringReader(html);
final XMLInputSource in = new XMLInputSource(null, "foo", null, sr, null);

// use the provided simple DocumentImpl
final DOMParser parser = new DOMParser(HTMLDocumentImpl.class);
parser.parse(in);

HTMLDocumentImpl doc = (HTMLDocumentImpl) parser.getDocument();
NodeList headings = doc.getElementsByTagName("h1");

SAXParser

Using the SAXParser is straigtforward - simple provide your own org.xml.sax.ContentHandler implementation.

final String html =
            " <!DOCTYPE html>\n"
            + "<html>\n"
            + "<body>\n"
            + "<h1>NekoHtml</h1>\n"
            + "</body>\n"
            + "</html>";

final StringReader sr = new StringReader(html);
final XMLInputSource in = new XMLInputSource(null, "foo", null, sr, null);

final SAXParser parser = new SAXParser();

ContentHandler myContentHandler = new MyContentHandler();
parser.setContentHandler(myContentHandler);

parser.parse(in);

Features

The behavior of the scanner/parser can be influenced via a series of switches.

parser.setFeature(HTMLScanner.PLAIN_ATTRIBUTE_VALUES, true);

Supported features:

Properties

The behavior of the scanner/parser can be influenced via a series of switches.

parser.setProperty(HTMLScanner.ENCODING_TRANSLATOR, EncodingMap.INSTANCE);

Supported properties:

Last CI build

The latest builds are available from our Jenkins CI build server

Build Status

If you use maven please add:

<dependency>
    <groupId>org.htmlunit</groupId>
    <artifactId>neko-htmlunit</artifactId>
    <version>4.7.0-SNAPSHOT</version>
</dependency>

You have to add the sonatype snapshot repository to your pom repositories section also:

<repository>
    <id>OSS Sonatype snapshots</id>
    <url>https://s01.oss.sonatype.org/content/repositories/snapshots/</url>
    <snapshots>
        <enabled>true</enabled>
        <updatePolicy>always</updatePolicy>
    </snapshots>
    <releases>
        <enabled>false</enabled>
    </releases>
</repository>

Porting from 3.x to 4.x

Version 4.x introduces a major change in the handling of encodings - the mapping from the encoding label found in the meta tag to the encoding to be used for parsing the document got some significant changes. Starting with version 4.0 the mapping is now in sync with the spec.

For this also

Porting from 2.x to 3.x

Usually the upgrade should be simple:

But we have removed some features and some classes in version 3. If you have any problems or if you miss something important for your project, please open an issue.

Start NekoHtml Development

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

You simply only need a local maven installation.

Building

Create a local clone of the repository and you are ready to start.

Open a command line window from the root folder of the project and call

mvn compile

Running the tests

mvn test

Contributing

Pull Requests and and all other Community Contributions are essential for open source software. Every contribution - from bug reports to feature requests, typos to full new features - are greatly appreciated.

Deployment and Versioning

This part is intended for committer who are packaging a release.

   mvn versions:display-plugin-updates
   mvn versions:display-dependency-updates
   mvn -U clean test
   mvn -up clean deploy

History

HtmlUnit has been using CyberNeko HTML parser (http://nekohtml.sourceforge.net/) for a long time. But since the development was discontinued around 2014, we started our own fork, which now has many improvements.

As of version 2.68.0, neko-htmlunit also uses its own fork of Xerces (https://github.com/apache/xerces2-j). This forked code is integrated into the code base to further reduce the external dependencies.
This made it possible to remove many unneeded parts and dependencies to ensure e.g. compatibility with Android.

Authors

License

This project is licensed under the Apache 2.0 License

Acknowledgments

Many thanks to all of you contributing to HtmlUnit/CSSParser/Rhino/NekoHtml in the past.