adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.53k stars 255 forks source link

Extract inline structured data from page <body> #173

Open Seirdy opened 2 years ago

Seirdy commented 2 years ago

It seems that trafilatura attempts to parse schema.org-compliant structured data such as microdata and some RDFa, which is great; however, this is limited to data present in the document <head>.

The main reason to use microdata, RDFa, and microformats instead of just meta tags, Open Graph (an optionally-compliant subset of RDFa), etc. is to add semantic meaning to visible page content. That means marking up <body> content. A byline could have an "author" property to mark up the author's name, for instance.

I suggest expanding RDFa and microdata extraction to the full document, the <body>, or a subset of the <body> that's marked as significant such as <main> or mainEntity. Alternatively, Trafilatura could make use of an external library that specializes in extracting all structured data across a document such as extruct.

Seirdy commented 2 years ago

A closer look reveals that a the separate metaxpaths.py does scan the body for the author and headline microdata properties, but not the description property; metadata.py scans the <head> for all three. Moving all microdata extraction to the same document scope would make this more simple, consistent, and thorough.

adbar commented 2 years ago

Hi @Seirdy, it seems like an interesting idea but I don't quite see what is currently lacking in the software.

Could you please provide a concrete example of what you would like to achieve? Is the description property you mention frequent?

Seirdy commented 2 years ago

On Tue, Feb 15, 2022 at 11:17:56AM -0800, Adrien Barbaresi wrote:

Could you please provide a concrete example of what you would like to achieve? Is the description property you mention frequent?

Sorry, description was a bad example. Publishers, tags, Dublin-Core metadata, and licnese info are better examples.

Currently, Trafilatura extracts metadata from JSON-LD including the author, headline, category, etc. This is typically included in the

of the page. It also has a similar metadata.py for microdata and RDFa vocabs (inc. Dublin Core!). But microdata is a way of adding semantic information to the *body* of a page. It's really not meant to be used in the . RDFa can be used in either the body or the . For example: if we were to ignore the fact that seirdy.one uses Open Graph metadata in the document , posts on https://seirdy.one/ have microdata (and microformats) that mark up the article, headline, author, license, etc. Trafilatura's microdata parsing would ignore most of this.
adbar commented 2 years ago

Thanks for the info, I get your point. I don't know how rare it is but I assume it is uncommon for web pages to convey information in the HTML body which is not present in the header.