cdimascio / essence

Automatically extract the main text content (and more) from an HTML document
Apache License 2.0
116 stars 16 forks source link
extractor hacktoberfest html-extractor scraper web-content-extractor webpage-extractor website-extractor

essence

Maven Central All Contributors

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Install

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Try the Essence web demo

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

Extracted data elements

essence attempts to extract the following content:

Credits

License

Apache 2.0

Buy Me A Coffee

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Clément P.

💻

This project follows the all-contributors specification. Contributions of any kind welcome!