RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 558 forks source link

Add JSON-LD extraction from HTML #2804

Closed wallberg closed 3 months ago

wallberg commented 5 months ago

Implementation of issue #2692.

See also https://w3c.github.io/json-ld-syntax/#embedding-json-ld-in-html-documents and https://www.w3.org/TR/json-ld11-api/#html-content-algorithms .

Summary of changes

If source.content_type is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.

The default behavior is to extract only the first matching script element. These overrides are available:

Detailed changes

rdflib.plugins.parsers.jsonld.JsonLDParser.parse

rdflib.plugins.shared.jsonld.util.source_to_json

test/jsonld/test_onedotone.py

test/jsonld/runner.py

Breaking Changes

When rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

Checklist

coveralls commented 5 months ago

Coverage Status

coverage: 91.036% (+0.006%) from 91.03% when pulling 53b353fbbf5147b9d2b6654532fbcc553b6881c7 on wallberg:issue-2692-embedded-jsonld into 0ecc40009ae397c2798c0c08a2d751a1a9d2f8a7 on RDFLib:main.

coveralls commented 3 months ago

Coverage Status

coverage: 91.067% (+0.02%) from 91.047% when pulling 13166ecea8fa8696895fcd83f148a46005b965a9 on wallberg:issue-2692-embedded-jsonld into bb170723b21c1cfb5b90f05b541d02be53867574 on RDFLib:main.

nicholascar commented 3 months ago

@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review?

Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests.

wallberg commented 3 months ago

@nicholascar yes, ready for review.

ashleysommer commented 3 months ago

I'm happy to see this is using the built-in html.parser library, because we will soon be removing the old html5lib dependency from our dependencies.