Open Seirdy opened 2 years ago
A closer look reveals that a the separate metaxpaths.py does scan the body for the author and headline microdata properties, but not the description property; metadata.py scans the <head>
for all three. Moving all microdata extraction to the same document scope would make this more simple, consistent, and thorough.
Hi @Seirdy, it seems like an interesting idea but I don't quite see what is currently lacking in the software.
Could you please provide a concrete example of what you would like to achieve? Is the description property you mention frequent?
On Tue, Feb 15, 2022 at 11:17:56AM -0800, Adrien Barbaresi wrote:
Could you please provide a concrete example of what you would like to achieve? Is the description property you mention frequent?
Sorry, description was a bad example. Publishers, tags, Dublin-Core metadata, and licnese info are better examples.
Currently, Trafilatura extracts metadata from JSON-LD including the author, headline, category, etc. This is typically included in the
of the page. It also has a similar metadata.py for microdata and RDFa vocabs (inc. Dublin Core!). But microdata is a way of adding semantic information to the *body* of a page. It's really not meant to be used in the . RDFa can be used in either the body or the . For example: if we were to ignore the fact that seirdy.one uses Open Graph metadata in the document , posts on https://seirdy.one/ have microdata (and microformats) that mark up the article, headline, author, license, etc. Trafilatura's microdata parsing would ignore most of this.Thanks for the info, I get your point. I don't know how rare it is but I assume it is uncommon for web pages to convey information in the HTML body which is not present in the header.
It seems that trafilatura attempts to parse schema.org-compliant structured data such as microdata and some RDFa, which is great; however, this is limited to data present in the document
<head>
.The main reason to use microdata, RDFa, and microformats instead of just meta tags, Open Graph (an optionally-compliant subset of RDFa), etc. is to add semantic meaning to visible page content. That means marking up
<body>
content. A byline could have an "author" property to mark up the author's name, for instance.I suggest expanding RDFa and microdata extraction to the full document, the
<body>
, or a subset of the<body>
that's marked as significant such as<main>
ormainEntity
. Alternatively, Trafilatura could make use of an external library that specializes in extracting all structured data across a document such as extruct.