SolidLabResearch / Challenges

24 stars 0 forks source link

Decentralised Reasoning with large knowledge bases #14

Open pbonte opened 2 years ago

pbonte commented 2 years ago

Pitch

Users can have links to large knowledge bases within their pod. However, when performing querying under a certain entailment regime, i.e. with reasoning enabled, the whole large knowledge base might need to be inspected.

For example:

Desired solution

To enable query under entailment when data links to a large knowledge bases, such as DBPedia, the query engine and reasoning module would need to be capable of:

Acceptance criteria

A demo that displays a solution to this problem would require the following to be accepted:

Pointers

Insights on the differences between forward and backward chaining: https://github.com/pbonte/ReasoningAssignment/blob/master/reasoning_assignment.pdf

rubensworks commented 2 years ago

Also cc'ing @jeswr so he's aware of this.

RubenVerborgh commented 2 years ago

@pbonte Thanks for submitting this one!

Quick question—the intro says:

Users can have links to large knowledge bases within their pod. However, when performing querying under a certain entailment regime, i.e. with reasoning enabled, the whole large knowledge base might need to be inspected.

However, from the way it is phrased to me, it is not clear whether the inspection of the whole knowledge base is a given or an issue to be fixed, and if so, if the amount of data being transferred (or the total time?) is a quality attribute. Could you adjust the description to clarify this? Thanks!

RubenVerborgh commented 2 years ago

@pbonte Also, can we apply this (currently generic) challenge to a specific use case, such that the demo becomes a very specific and concrete target to reach? We could either adjust the current challenge text, or create a new issue that is basically an applied version of this bigger challenge, and link back to here?

jeswr commented 2 years ago

I'm planning on creating a small (perhaps dummy) demo app to show the principle of my honours work as part of a mid-term presentation I have to give next week (Tuesday 1/3/22) - so happy to try and align this with a relevant use case here if I can.

A tentative plan (off the top of my head) was to do some kind of inference to materalize facts about diseases individuals may be predisposed to - with reasoning used to:

And federating this with shared information about your grandFather to establish whether or not you are likely to be bald.

Of course this could be then similarly applied to many other conditions.

PS. I'm far from a medical domain expert and I'm absolutely butchering the data modelling here - this is just designed to be a proof of concept :).

jeswr commented 2 years ago

Something else that could be exploited when reasoning against such databases is that they usually already have some form of reasoning applied to them. @pbonte do you know if there is any research so far into making use of this fact so as to reduce the amount of reasoning that needs to be done on the application side.

In the context of the work that I am doing with Comunica for my honours - I was thinking of having a context annotation for each data source which indicated which types of reasoning had already been applied to them - with the thought in the long term being that data sources should provide metadata about any form of 'pre-reasoning' that has been done on them.

edit Some vaguely related works

pbonte commented 2 years ago

This would indeed be very interesting and allow one to save a lot of computational efforts, however, I think it's risky to make the assumption that (decentralized) knowledge bases would maintain up to date metadata about their materialisation status. I can imagine that not all knowledge bases would incrementally update their materialisation when some new facts are added, thus requiring the metadata to also maintain a list of unprocessed facts. Furthermore, this would only make sense when the same logic is used. Lets say the remote knowledge base uses RDFS, while the client uses some more expressive rule bases language (N3/SWRL/OWL2 RL), then its not completely clear how we can reuse the RDFS materialization as intermediate results. Perhaps future research can show us.

If however, knowledge bases do maintain this metadata, and the logics can somehow be aligned, then yes, this would be very interesting! On top of my head i think it would mean the following:

RubenVerborgh commented 2 years ago

I think it's risky to make the assumption that (decentralized) knowledge bases would maintain up to date metadata about their materialisation status

Yes, but also see it the other way: we are in the position to make recommendations and specs.

So your IF can be enforced if that's a recommendation we can argue 😃

jeswr commented 2 years ago

Furthermore, this would only make sense when the same logic is used. Lets say the remote knowledge base uses RDFS, while the client uses some more expressive rule bases language (N3/SWRL/OWL2 RL), then its not completely clear how we can reuse the RDFS materialization as intermediate results. Perhaps future research can show us.

One step may be to create a document that defines the relationships between various rule languages, and respectively the relationships between rule sets that can be defined within those languages. The former being more useful from a technical perspective to achieve things like https://github.com/comunica/comunica-feature-reasoning/issues/22 whilst the latter can be used to inform whether the remote source has materialised all of the implicit facts that would have been materialized by the rule set being used for the federated reasoning.

In the case that you mention where only a subset of the implicit data has been produced by the remote source; we could perhaps extend this idea to re-use the intermediate results by identifying the diff of which rules haven't been applied to the remote source and only apply those to the remote source in step 1 of the naive algorithm mentioned in https://github.com/comunica/comunica-feature-reasoning/issues/23.

Of course this still begs the question of how to handle implicit data within the remote source that would not have been produced by the rule set you are using; i.e. do we need to remove anything from our results?

pheyvaer commented 2 years ago

@pbonte Why does the acceptance criteria mention the need for a UI to show that there is a link with for example DBpedia? Isn't it enough to show that a resource in the pod refers to DBpedia? For example, by inspecting that resource?

pbonte commented 2 years ago

@pheyvaer because the goal of the challenges is to have a number of demonstrators. Its just a way to show the content, nothing fancy

pheyvaer commented 2 years ago

Make sense! Maybe we could put that as a separate challenge? Have a UI that shows data with links to external data sources? Because they can then be reused by others as well for their demos maybe.

pbonte commented 2 years ago

Good idea! Maybe even something more generic, showing some visualisation of the pod content that can be reused across demos? (both internal as external data) I think it might also be useful to have some tooling to show the partitioning of the data across pods, so we can show that each user keeps ownership of their own data.

pheyvaer commented 2 years ago

Sounds good! Do you want to put it in a separate issue/challenge?

pbonte commented 2 years ago

Not sure if it counts as a challenge (as these are described as: A concrete technical problem applied to a specific use case)?

pheyvaer commented 2 years ago

Yeah, it counts as a challenge 😉

pheyvaer commented 2 years ago

@pbonte Can you make the extra challenges as mentioned earlier?