RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.11k stars 547 forks source link

Draft: Add JSON-LD extraction from HTML #2804

Open wallberg opened 1 week ago

wallberg commented 1 week ago

Draft implementation of issue #2692.

See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Summary of changes

If source.content_type is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.

The default behavior is to extract only the first matching script element. These overrides are available:

Detailed changes

rdflib.plugins.parsers.jsonld.JsonLDParser.parse

rdflib.plugins.shared.jsonld.util.source_to_json

test/jsonld/test_onedotone.py

test/jsonld/runner.py

Breaking Changes

When rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

Checklist

coveralls commented 1 week ago

Coverage Status

coverage: 91.036% (+0.006%) from 91.03% when pulling 53b353fbbf5147b9d2b6654532fbcc553b6881c7 on wallberg:issue-2692-embedded-jsonld into 0ecc40009ae397c2798c0c08a2d751a1a9d2f8a7 on RDFLib:main.