apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.08k stars 642 forks source link

Adds support for SPARQL CDTs (lists and maps as literals) #2501

Open hartig opened 1 month ago

hartig commented 1 month ago

This closes #2518

The lack of built-in support for generic types of composite values such as lists and maps has been a long-standing issue for RDF and SPARQL. Together with a few other colleagues at the Amazon Neptune team we have developed an approach to represent lists and maps as literals in RDF data, and to extend SPARQL with features related to such literals. These extensions of SPARQL include:

We have created a complete formal specification of the approach and a comprehensive test suite for implementers, which can be found in our Github repo: https://github.com/awslabs/SPARQL-CDTs

... and I have implemented a complete integration of this approach into Jena, as can be found in this PR. We would like to contribute this implementation to Jena if you are interested, and I will be more than happy to assist you with getting the PR ready to be merged.

Perhaps before you dive into the aforementioned specification, you may take a look at our short paper, in which we provide a slightly more extensive motivation for this work and a (very!) brief summary of the approach. After that, Section 2 of the specification provides a more detailed informal description of the different parts of the approach.

I am happy to answer any questions that you may have, both about the approach in general and about the implementation in this PR. Also, if you have issues with some parts of the specification, feel free to create an issue in the aforementioned Github repo. (And in case you are wondering, yes we are planning to file the approach as a SPARQL Enhancement Proposal (SEP) for the SPARQL-dev Community Group).


By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

afs commented 1 month ago

This looks very interesting - I'm (non-chair opinion) keen to see pathways from new ideas to validation by real use.

This is a large PR. There'll need to be some process for contributing something of this significance. We'll need some paperwork (an ICLA from @hartig) and we need a clear position of the ownership of the code and IP. None of these things are particularly unique to this PR, its just that most PRs some

Jena 5.1.0 is approaching (mid-June ish) has a major new item - jena-ontapi (which incidentally we did some acceptance process for). This PR will take some time so 5.1.0 is unlikely but we can start on the work.

Thank you for the contribution!

SimonBin commented 1 month ago

just as a quick remark, our JenaX extension has contained RDF Array and Set functions very much similar to those suggested here, as well as functions for working with JSON and XML data, for a long time already. They do not require any changes to the core SPARQL syntax and you can already use them today by simply dropping the jenax-arq-plugins-bundle JAR file into your classpath.

Very short function summary: http://jsa.aksw.org/fn/ Source code: https://github.com/Scaseco/jenax

hartig commented 1 month ago

Thanks @afs !

Regarding the ICLA, I should mention that I have done this work in the context of my role at Amazon Neptune. So, you can understand this implementation as a contribution from Amazon. We have internal approval to contribute this code to Jena.

I see here that there is also a CCLA, for which I am not sure whether it applies in this case. I will get in touch with the open source coordinator of our team to see what he says. In any case (CCLA or not), I understand that I have to provide you with an ICLA regardless. I will take care of that once I am back in the office (I am traveling at the moment and will continue to do so for most of the next two weeks).

Also, I understand that this is a large PR. If it would be easier, I can also try to break it into a sequence of smaller PRs, one part at a time; for instance, a first part about reading and writing the literals, a second part that adds the functions (cdt:get, cdt:size, etc), a third one that adds the FOLD aggregate, etc. Let me know if you would prefer such a step-wise process.

namedgraph commented 1 month ago

Is new SPARQL syntax really required for this? IMO that is a major drawback. Couldn't extension functions suffice? As a GeoSPARQL-like approach? I think @SimonBin's work shows that it can be done.

hartig commented 1 month ago

@namedgraph Since your questions are about the approach in general, I think it is better to move this discussion somewhere else and keep the comments here specific to the Jena implementation of the approach as provided in this PR. Therefore, I have created an issue in our repo with the artifacts related to the approach--see: https://github.com/awslabs/SPARQL-CDTs/issues/7

hartig commented 1 month ago

@afs We have discussed internally whether we need an CCLA and came to the decision that this is not needed in this case. It is sufficient that there are commits that have me as author with my Amazon email address. So, an ICLA from me should be fine, which I will prepare when I am back in the office.

rvesse commented 1 month ago

Also, I understand that this is a large PR. If it would be easier, I can also try to break it into a sequence of smaller PRs, one part at a time; for instance, a first part about reading and writing the literals, a second part that adds the functions (cdt:get, cdt:size, etc), a third one that adds the FOLD aggregate, etc. Let me know if you would prefer such a step-wise process.

Breaking into smaller PRs might be useful and likely make it easier to review. However large features can arrive in a single PR (like the recent jena-ontapi PR) if that is the most appropriate way to contribute them. It doesn't seem like the individual pieces of functionality necessarily are valuable on their own without the other pieces which would suggest keeping this as a single PR (unless other committers disagree) for the time being.

In more practical terms generally we try and keep the commit history relatively succinct given the 20+ year history of the project so squashing the commits down in some form would be useful. Obviously with a contribution of this size squashing to a single commit would be excessive, but the current 180 commits is the opposite extreme. So squashing into some sensible number of commits (and I'll leave that to your judgement) where each commit represents some coherent chunk of development of this feature would be helpful.

hartig commented 1 month ago

[@rvesse] So squashing into some sensible number of commits (and I'll leave that to your judgement) where each commit represents some coherent chunk of development of this feature would be helpful.

I have never done such squashing before, so I may need a bit of help in terms of pointers to documentation of how I can do this. A complication may be that the commits are not in an order in which their sequence can be broken into subsequences that each focus on a specific part of the overall approach, because the different parts of the approach evolved simultaneously and I adapted the implementation as we went along with developing the approach.

Isn't it also an option, when merging this PR eventually, to merge it as a single commit?

afs commented 1 month ago

@afs We have discussed internally whether we need an CCLA and came to the decision that this is not needed in this case. It is sufficient that there are commits that have me as author with my Amazon email address. So, an ICLA from me should be fine, which I will prepare when I am back in the office.

A CCLA is not necessary.

This is a specific contribution so a Software Grant (SG) will cover the copyright and IP for this contribution. The SG covers only this specific contribution from AWS.

Any future changes, improvements and maintenance don't need a Software Grant every time. An SG is for substantive works.

With an SG and ICLA, how the contribution was created with AWS and the IP that resides in the contribution are covered.

ICLA:

Software Grant:

afs commented 1 month ago

While there are a lot of files, it looks to me like the majority are consequences of adding FOLD and UNFOLD as SPARQL extensions.

I've been though the description and the code (to some extent) and I don't a partition into natural commits (as @rvesse says things that operate on their own). So one commit.

We can squash as it is merged but then the commit is co-owned. It would be cleaner if the commit has your name on it.

I've pulled and built the PR so that looks good. The acceptance gateway includes whether "mvn clean install" completes. That builds Jena - it doen't take too long on a laptop.

OR There are also github action in the jena codebase and in the clone you have: https://github.com/hartig/jena/actions/workflows/maven.yml

I had to fix one thing the build: jena-arq/src/main/java/org/apache/jena/riot/system/CDTAwareParserProfile.java

which does not have the license header. The build runs the "RAT" - Release Audit Tool - catches this in the maven build.

"This branch cannot be rebased due to conflicts" - the PR at the moment isn't in line with the jena project main branch.

If you pull the Apache main branch into the main of your repo, then do git rebase main it will stop for conflicts to be resolved. I tried and the conflicts I looked at were mostly formatting when Jena main had been tidied up a bit.

afs commented 1 month ago

Is new SPARQL syntax really required for this? IMO that is a major drawback. Couldn't extension functions suffice? As a GeoSPARQL-like approach? I think @SimonBin's work shows that it can be done.

While it may be possible within existing syntax, this area is fundamental and IMO would benefit with syntax. It is not a call-out extension like text query, geo or LLM's, though extensions call-out needs something more agreed than property functions or use of SERVICE. The SPARQL standard needs a multi-bind. At the moment, the optimizer needs to make them special cases.

We do have to be clear that this syntax proposal, and the future SEP, are evolving (as is LATERAL) and subject to change.

afs commented 1 month ago

@hartig - Could you please create an issue at https://github.com/apache/jena/issues for this.

I'd prefer that it is the contributor who creates the issue. We can edit descriptions afterwards. Either minimal content or the content can be taken from the description above mins the "..." and the "---" and after.

Thanks.

SimonBin commented 1 month ago

It is not an a call-out extension like text query, geo

Precisely because the GeoSPARQL 1.1/1.2 aggregate functions would also greatly benefit from being able to ORDER BY and UNFOLD in certain ways, I believe we should take some extra thought while designing the final version of this SEP

afs commented 1 month ago

It is not an a call-out extension like text query, geo

Precisely because the GeoSPARQL 1.1/1.2 aggregate functions would also greatly benefit from being able to ORDER BY and UNFOLD in certain ways, I believe we should take some extra thought while designing the final version of this SEP

Yes - the general case of SPARQL would benefit.

That was a response to "Is new SPARQL syntax really required for this? IMO that is a major drawback. Couldn't extension functions suffice?".

Property functions aren't very intuitive as the arguments become more complex (c.f. the alternative of use of SERVICE)

I would suggest that a proposal be developed - which can be a Jena change if having a concrete implementation that community can engage with. We just have to be clear what is experimental/subject to change and what is stable.

hartig commented 1 month ago

[@SimonBin] Precisely because the GeoSPARQL 1.1/1.2 aggregate functions would also greatly benefit from being able to ORDER BY and UNFOLD in certain ways, I believe we should take some extra thought while designing the final version of this SEP

[@afs] I would suggest that a proposal be developed - which can be a Jena change if having a concrete implementation that community can engage with. We just have to be clear what is experimental/subject to change and what is stable.

Regarding ORDER BY for aggregate functions, I think everything needed for such a proposal is already covered by our SPARQL CDTs approach:

hartig commented 1 month ago

[@afs] Could you please create an issue at https://github.com/apache/jena/issues for this.

Done: #2518

[@afs] While there are a lot of files, it looks to me like the majority are consequences of adding FOLD and UNFOLD as SPARQL extensions.

Yes indeed, and then there are also several new files for the functions (cdt:get, cdt:size, etc) as well as for the new datatypes and literals itself (incl. the parser of their lexical forms).

[@afs] The acceptance gateway includes whether "mvn clean install" completes. [...] I had to fix one thing the build: jena-arq/src/main/java/org/apache/jena/riot/system/CDTAwareParserProfile.java

which does not have the license header. The build runs the "RAT" - Release Audit Tool - catches this in the maven build.

I have added the license header (commit https://github.com/apache/jena/pull/2501/commits/6de9f470bdf29708085f42452a8e6b01752190fa).

With this change, mvn clean install completes on my laptop without problems.

[@afs] "This branch cannot be rebased due to conflicts" - the PR at the moment isn't in line with the jena project main branch.

I have synced the main branch of my fork, did git rebase master in my feature branch, and going through the conflicts now -- commit by commit. I have been on it for two hours now. It seems there are conflicts for almost all of my commits. I will let you know when I am done.

afs commented 1 month ago

Done: https://github.com/apache/jena/issues/2518

Thanks - I added an automatic close-issue into the description of this PR.

hartig commented 1 month ago

@afs I am done with rebasing now.

mvn clean install still runs without problems, and I can also run our SPARQL CDTs test suite.

afs commented 1 month ago

@afs I am done with rebasing now.

Excellent.

If I set to "Rebase and Merge" there is something showing. 148 commits in (commit 9cc5647c3c).

Part of the conflict is

both added:      jena-arq/src/main/java/org/apache/jena/sparql/syntax/ElementUnfold.java

which suggests both parts are for this PR. Several I looked at are the same update, very similar via two different routes.

The other github merge strategies do allow merge although it does not guarantee git will pick the right option.

It doesn't like a "semantic" (!!) problem -- I'll investigate further.

afs commented 1 month ago

The rebasing has made it a clear which files have changed and that makes reviewing easier - thanks.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

However, the rebasing on the PR has also made that unnecessary.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Now, thinking about the contribution --

There are two considerations:

There is a realistic chance that SPARQL-CDTs will evolve and that may be in incompatible ways. By being clear this is "experimental", I don't think that is a problem if not using the features means no impact. I (non-PMC chair opinion) am keen to see new tech being available to users, LATERAL being a recent example. (Disclosure: Olaf and I are both involved in RDF 1.2 where issues of evolution are important.)

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Functionality:

While CDTs are experimental, and likely to change, Jena needs a way to test teh functionality it ships and it does that by running test suites every build. An Apache project is more than open source - the code needs to be able to build. Jena releases the repository source tree - the binary artifacts ("convenience binaries") are secondary.

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. The W3C RDF/SPARQL tests in th ejena repo are the same status - local copies.

Technical points: (this is for later - not required for accepting this PR).

1: It may be better to do CDT's as StreamRDF processing rather than as a parse profile, if the blank node mapping, and while we're at it, the prefix mapping, can be made available to the stream which they aren't at the moment.

2: When thinking about persistent storage, the bnode node datastructures may need to a have a longer lifetime that the parser run including persistent storage. "Test to create a cdt:List with blank nodes via STRDT" for example.

SimonBin commented 3 weeks ago

I have some high-level comments

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (https://github.com/w3c/sparql-dev/issues/6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions:

generic UNFOLD:

further links:

Appendix

comparison of JenaX syntax vs. CDT

CDT JenaX
SELECT ?name (FOLD(?person) AS ?list) SELECT ?name (array:collect(?person ) AS ?list)
UNFOLD (?list AS ?element) ?list array:unnest ?element
BIND(cdt:get(?map, "address") AS ?addressSubMap) BIND(json:get(?map, "address") AS ?addressSubMap)
BIND(cdt:put(?addressSubMap, "number", 42) AS ?newAddressSubMap) BIND(json:set(?addressSubMap, "number", 42) AS ?newAddressSubMap)
"[ '1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date>, 42 ]"^^cdt:List "'1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date> 42"^^norse:array

Further ideas present in JenaX:

hartig commented 3 weeks ago

[@SimonBin] I have some high-level comments

Thanks for the pointer. I knew that Stardog had something similar, but I wasn't aware of this blog post. Looking at it now, I see some relevant differences: The Stardog approach changes the notion of a solution mapping by extending the co-domain of these mappings with two new kinds of elements (namely, solution mappings themselves, which makes the notion recursive, and some kind of array). While I cannot find any formal specification of this approach, I can see that extending the definition of solution mapping in the SPARQL spec in this way would have a wide-ranging impact on most of the formal parts of the spec (the definition of every operator that processes solution mappings would need to be adapted). In contrast, adding a new operator (UNFOLD) and a new aggregation function (FOLD), as done by our approach, would be a comparably modest change/extension to the SPARQL spec (and we already provide all relevant parts of this extension explicitly in our SPARQL CDTs spec, ready to be copied directly into the SPARQL spec; rather than having "only" an implementation of an idea described informally in a blog post). Moreover, the relationship between Stardog's approach (which seems to be only about SPARQL) and the RDF data model is also not clear to me from the blog post (e.g., what happens if a solution mapping that contains such an array or a nested solution mapping is passed to a CONSTRUCT pattern?), whereas our approach is simply based on RDF literals and, thus, is clearly within the realm of RDF as is.

  • why do we need cdt:List and cdt:Map when the syntax already decides the object type (cf. in JenaX we have a single JSON data type)

Some vendors may choose to support only one or the other. Moreover, a potential future addition may be a cdt:SortedMap datatype which would likely have the same lexical space as cdt:Map, or a cdt:Set (as you mention JenaX has).

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (https://github.com/w3c/sparql-dev/issues/6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions: [...]

Our spec contains all the relevant parts to define ORDER BY as a generic addition to be used also by other aggregation functions, and the code in this PR has everything needed to implement such a generic ORDER BY for arbitrary aggregation functions. For details, see my earlier comment above.

generic UNFOLD: [...]

While, also for this one, our spec contains some spec text relevant for defining a generic "multi-BIND" (à la https://github.com/w3c/sparql-dev/issues/6), in contrast to the ORDER BY case, doing so is a bit more complicated than directly copying over this spec text into the SPARQL spec. Yet, in the long run, I can certainly see the benefit of extending SPARQL with such a generic "multi-BIND" and, then, defining our UNFOLD as an application of the generic definition.

hartig commented 3 weeks ago

@afs apologies for the delay; after I came back from traveling, my university responsibilities required all my time.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

The reason might be that some of the commits within the PR may be conflicting with one another. Right before creating the PR, I already rebased to the then-latest main and realized that some parts of the internal API of Jena had changed quite a bit (e.g., LiteralLabel is a final class now but was an interface in the version that I forked and extended), which meant that I had to change parts of my implementation.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Perfect.

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Would it be an option to move the code of CDTAwareParserProfile into ParserProfileStd and, then, remove CDTAwareParserProfile altogether?

Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in ModLangParse, similar to --strict). If switched off, RiotLib would create a ParserProfileStd instance instead of a CDTAwareParserProfile instance.

Yet another option may also be to combine both of these ideas. What do you think?

Functionality:

[...] Jena needs a way to test teh functionality it ships and it does that by running test suites every build. [...]

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. [...]

Of course. I can copy these tests.

I assume they would go into a new subdirectory under ./jena-arq/testing/, right?

After I have created the copy, where do I have to add a pointer to them such that they are run automatically during a build?

Currently, I run these tests manually using arq.rdftests. They all pass, except for four tests.

These four tests are list-functions/sameterm-03.rq, list-functions/sameterm-04.rq, map-functions/sameterm-03.rq, and map-functions/sameterm-04.rq. All of them check the behavior of SAMETERM. They did pass in my original implementation, but after I had to reimplement some things because the internal API of Jena had changed (the aforementioned change of LiteralLabel, in particular), they fail. I think it would be possible to make them pass again, but that would require adding a new constructor to LiteralLabel---namely, a constructor via which it is possible to provide both the lexical form and the value. Let me know if you want me to push a corresponding commit to this PR in order to demonstrate what exactly I mean, and if you don't like it, I can revert this commit. Alternatively, I may simply comment out the four failing tests for now.

hartig commented 1 week ago

@afs I have now copied the manifest-driven tests from the SPARQL CDTs repo into this PR and integrated them to run automatically as unit tests (see commit https://github.com/apache/jena/pull/2501/commits/25494354c61e2999747b297b99d86e8bfca1d5cd). Of course, that increases the number of files in the PR even more ;-)

Additionally, related to the very last paragraph of my previous comment, I have changed the implementation such that the SAMETERM tests don't fail (see commit https://github.com/apache/jena/pull/2501/commits/9cb08f1f4529c170680d5235c35ac8bf442f498e). To this end, I had to add a new constructor to LiteralLabel and a corresponding create-function to LiteralLabelFactory. Please take a look and let me know whether you consider these changes appropriate.

(I am still waiting for our legal folks to advise me on how to proceed with creating the Software Grant Agreement.)