Summary of changes

If source.content_type is "text/html" or "application/xhtml+xml" then parse the document as HTML and extract script elements of type="application/ld+json" as JSON-LD.

The default behavior is to extract only the first matching script element. These overrides are available:

To extract all script elements: supply an optional extract_all_scripts=True parameter to JsonLDParser.parse()
To extract one script element with a specific id attribute value: add the id value as a fragment identifier in the IRI available from source.getSystemId()

Detailed changes

rdflib.plugins.parsers.jsonld.JsonLDParser.parse

add docstring
change parameter list from **kwargs to explicit list
add optional extract_all_scripts parameter
get the fragment identifier from source.getSystemId()
add fragment_id and extract_all_scripts parameters to the call to source_to_json

rdflib.plugins.shared.jsonld.util.source_to_json

add docstring
add optional fragment_id and extract_all_scripts parameters
change the return value to a tuple with the extracted JSON document and value of the HTML base element
if source.content_type is "text/html" or "application/xhtml+xml" then parse source as HTML and extract the appropriate script element(s) and the HTML base element

test/jsonld/test_onedotone.py

enable all existing html tests (except html/f004-in). (Note: for more information on the failing html/f004-in test, see https://lists.w3.org/Archives/Public/public-json-ld-wg/2024May/0000.html)
if inputpath ends with ".html" (with optional fragment identifier) then invoke runner.do_test_html

test/jsonld/runner.py

add new do_test_html function (Note: the html test cases from the JSON-LD Test Suite combine testing for JSON-LD extraction from the HTML with testing for other algorithms (e.g. compact/flatten), which rdflib does not currently support. In order to test extraction only and ignore the compact/flatten algorithms, do_test_html performs a graph comparison using rdflib.compare.isomorphic, without serializing back to JSON)

Breaking Changes

When rdflib.plugins.shared.jsonld.util.source_to_json extracts JSON-LD from HTML, it needs to return the value of the HTML base element in addition to the JSON. I took the simplest path and returned a tuple containing the JSON and the base.

I can think of other ways to return the base without breaking the current return value:

Return json when processing a json document and tuple (json, base) when processing an html document.
Add an optional parameter to return tuple (json, base) instead of json.
Continue returning only json, but add an optional parameter which will receive the value of base.

Checklist

[x] Checked that there aren't other open pull requests for the same change.
[x] Checked that all tests and type checking passes.
If the change adds new features or changes the RDFLib public API:
- [x] Created an issue to discuss the change and get in-principle agreement. #2692
- [ ] Considered adding an example in ./examples.
If the change has a potential impact on users of this project:
- [x] Added or updated tests that fail without the change.
- [ ] Updated relevant documentation to avoid inaccuracies.
- [ ] Considered adding additional documentation.
[x] Considered granting push permissions to the PR branch, so maintainers can fix minor issues and keep your PR up to date.

coveralls commented 5 months ago

coverage: 91.036% (+0.006%) from 91.03% when pulling 53b353fbbf5147b9d2b6654532fbcc553b6881c7 on wallberg:issue-2692-embedded-jsonld into 0ecc40009ae397c2798c0c08a2d751a1a9d2f8a7 on RDFLib:main.

coveralls commented 3 months ago

coverage: 91.067% (+0.02%) from 91.047% when pulling 13166ecea8fa8696895fcd83f148a46005b965a9 on wallberg:issue-2692-embedded-jsonld into bb170723b21c1cfb5b90f05b541d02be53867574 on RDFLib:main.

nicholascar commented 3 months ago

@wallberg you have prefixed this PR with "Draft" but it's not actually a draft PR. Do you consider it ready for review?

Since @ashleysommer fixed the GitHub vlaidation pipeline, it appears to be passing all tests.

wallberg commented 3 months ago

@nicholascar yes, ready for review.

ashleysommer commented 3 months ago

I'm happy to see this is using the built-in html.parser library, because we will soon be removing the old html5lib dependency from our dependencies.

RDFLib / rdflib

Add JSON-LD extraction from HTML #2804

Summary of changes

Detailed changes

Breaking Changes

Checklist