An automatic web page content extractor for Kotlin and Java.
Given an HTML document, essence automatically extracts the main text content (and much more).
Try out the demo - a simple webapp to demonstrate essence.
This library is inspired by node-unfluff and its lineage
Java
import io.github.cdimascio.essence.Essence;
EssenceResult data = Essence.extract(html);
System.out.println(data.getText());
Kotlin
val data = Essence.extract(html)
println(data.text)
See Extracted data elements for additional extracted metadata.
Maven
<dependency>
<groupId>io.github.cdimascio</groupId>
<artifactId>essence</artifactId>
<version>0.13.0</version>
<type>pom</type>
</dependency>
Gradle
compile 'io.github.cdimascio:essence:0.13.0'
Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.
The essence web project lives here
essence attempts to extract the following content:
title
- The document's titlesoftTitle
- A version of title
with less truncationdate
- The document's publication datecopyright
- The document's copyright line, if presentauthor
- The document's authorpublisher
- The document's publisher (website name)text
- The main text of the document with all the junk thrown awayimage
- The main image for the document (what's used by facebook, etc.)videos
- An array of videos that were embedded in the article. Each video has src, width and height.tags
- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.canonicalLink
- The canonical url of the document, if given.lang
- The language of the document, either detected or supplied by you.description
- The description of the document, from <meta> tagsfavicon
- The url of the document's favicon.links
- An array of links embedded within the article text. (text and href for each)Thanks goes to these wonderful people (emoji key):
Clément P. 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!