Closed wallberg closed 3 months ago
@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review?
Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests.
@nicholascar yes, ready for review.
I'm happy to see this is using the built-in html.parser
library, because we will soon be removing the old html5lib
dependency from our dependencies.
Implementation of issue #2692.
See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .
Summary of changes
If
source.content_type
is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.The default behavior is to extract only the first matching script element. These overrides are available:
extract_all_scripts=True
parameter toJsonLDParser.parse()
source.getSystemId()
Detailed changes
rdflib.plugins.parsers.jsonld.JsonLDParser.parse
rdflib.plugins.shared.jsonld.util.source_to_json
test/jsonld/test_onedotone.py
test/jsonld/runner.py
do_test_html
function (Note: the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON)Breaking Changes
When
rdflib.plugins.shared.jsonld.util.source_to_json
extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.I can think of other ways to return the base without breaking the current return value:
Checklist
./examples
.