RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.15k stars 555 forks source link

Implement RDF Patch serializer #2877

Closed recalcitrantsupplant closed 1 month ago

recalcitrantsupplant commented 1 month ago

Supports serialization from Dataset instances only; triples and quads within a Dataset are supported.

Summary of changes

Three methods to create RDF Patches from RDFLib Datasets:

  1. Serialize a Dataset as an addition patch
  2. Serialize a Dataset as a delete patch
  3. Create a patch representing the difference between a Dataset instance and a target Dataset instance

Basic usage:

  1. ds.serialize(format="patch", operation="add")
  2. ds.serialize(format="patch", operation="remove")
  3. ds1.serialize(format="patch", target=ds2)

Complete examples are provided in an example script.

Checklist

ashleysommer commented 1 month ago

The mypy issues in main should be resolved now. I've resynced this PR, hopefully it parses now.

ashleysommer commented 1 month ago

@recalcitrantsupplant Still 4 mypy errors related to this PR:

poetry run python -m mypy --show-error-context --show-error-codes --junit-xml=test_reports/3.9-macos-latest-mypy-junit.xml
  rdflib/plugins/serializers/patch.py: note: In member "serialize" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:30: error: Signature of "serialize" incompatible with supertype "Serializer"  [override]
  rdflib/plugins/serializers/patch.py:30: note:      Superclass:
  rdflib/plugins/serializers/patch.py:30: note:          def serialize(self, stream: IO[bytes], base: Optional[str] = ..., encoding: Optional[str] = ..., **args: Any) -> None
  rdflib/plugins/serializers/patch.py:30: note:      Subclass:
  rdflib/plugins/serializers/patch.py:30: note:          def serialize(self, stream: IO[bytes], base: Optional[str] = ..., encoding: Optional[str] = ..., operation: Optional[str] = ..., target: Optional[Graph] = ..., header_id: Optional[str] = ..., header_prev: Optional[str] = ...) -> Any
  rdflib/plugins/serializers/patch.py: note: In function "serialize":
  rdflib/plugins/serializers/patch.py:60: error: "Graph" has no attribute "get_context"  [attr-defined]
  rdflib/plugins/serializers/patch.py: note: In member "serialize" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:74: error: "Graph" has no attribute "contexts"  [attr-defined]
  rdflib/plugins/serializers/patch.py: note: In member "_patch_row" of class "PatchSerializer":
  rdflib/plugins/serializers/patch.py:88: error: "Graph" has no attribute "default_context"  [attr-defined]
  Found 4 errors in 1 file (checked 401 source files)
coveralls commented 1 month ago

Coverage Status

coverage: 90.611% (+0.02%) from 90.595% when pulling e58efff8808c4ed1029ec4c264558443f2d724ba on recalcitrantsupplant:david/rdf-patch-serialiser into 0c11debb5178157baeac27b735e49a757916d2a6 on RDFLib:main.

ashleysommer commented 1 month ago

@recalcitrantsupplant I thought this was complete and passing tests and ready to merge, however after merging it there are now some test failures appearing in main, and I've verified locally it has something to do with the roundtrip tests. As this .patch parser is a registered RDFLib parser plugin, the test suite includes it in the roundtrip tests. There are a bunch of .nt->.patch conversions and .n3->.patch roundtrip tests that are failing with odd errors. I don't have time to troubleshoot it now.

As @edmondchuc mentioned above, I don't know what use this is as a regular RDFLib serializer, because you really need to associate it with actions such as Add or Remove or target, which are not defined by default when doing operations like roundtrip. This really makes sense as a helper function, that allows you to script the creation of a .patch file.

To allow main to pass tests again, I'm going to revert this merge. We can revisit it again soon.

ashleysommer commented 1 month ago

@recalcitrantsupplant Another update, I haven't revered this yet. I found that if I add a fallback to find cases where operation is not passed, but target is also not passed, then default to "add" operation. With simple change, now all roundtrip tests are passing. Does that sound like a sensible change to you? I would assume anyone tring to naively serialize a graph or Dataset to .patch format without an operation is going to by default want the "add" patch operation.

This prevents unexpectedly getting a .patch file that contains only the header block and no content (thats why the test cases were failing.)

ashleysommer commented 1 month ago

The issue described in the previous two comments is now fixed by #2898 so tests in main are now passing.