Decentralised Reasoning with large knowledge bases

pbonte commented 2 years ago

Pitch

Users can have links to large knowledge bases within their pod. However, when performing querying under a certain entailment regime, i.e. with reasoning enabled, the whole large knowledge base might need to be inspected.

For example:

A pod can contain links to DBPedia, for example :me foaf:knows <https://dbpedia.org/page/Tim_Berners-Lee>
In our ontology we might define knowledge across our own data and dbpedia data, for example :SolidEnthausiast(?x) <- dbo:wikiPageWikiLink(?x,?y) and SolidProject(?y) and SolidProject(?x)<- dbo:Software(?x) and rdf:label(?x,'solid')
When querying for all SolidEnthausiast DBPedia needs to be checked and the above rules need to be executed.
The execution of these rules using forward chaining would require the whole DBPedia knowledge base to be materialized, i.e. every concept in DBPedia needs to be reasoned upon, which is not feasible.
Backward chaining would be an option here, if the logic and rules allow it, however, the engine would need to know when to perform forward/backward chaining

Desired solution

To enable query under entailment when data links to a large knowledge bases, such as DBPedia, the query engine and reasoning module would need to be capable of:

determine if backward chaining is possible based on the defined domain knowledge and used logic
based on the dataset and query behaviour decide if backward or forward chaining is the best suited method for evaluation of the inference rules
include some stopping criteria when backward chaining does not halt.
Decide between different backward chaining options, e.g. include query rewriting (when there is a remote sparql endpoint) as fetching data from a remote data source might be very expensive. With query rewriting, multiple possible path can be investigated, however, rewriting the whole query might result in overloading the remote data source, as rewritten queries tend to become extremely complex.

Acceptance criteria

A demo that displays a solution to this problem would require the following to be accepted:

a UI that shows a pod that describes data that is linked to a large remote knowledge bases, such as dbpedia
defined domain knowledge that crosses both the local data in the pod and the data in the remote knowledge base
a query on the pod that requires the domain knowledge and the remote knowledge base to be investigated
an automated explanation which steps the reasoner took and why, e.g. an extension of the proof detailing why forward/backward chaining was used
Optionally: another query the is faster with forward chaining, so we can see that the reasoner can decide which procedure to follow.

Pointers

Insights on the differences between forward and backward chaining: https://github.com/pbonte/ReasoningAssignment/blob/master/reasoning_assignment.pdf

rubensworks commented 2 years ago

Also cc'ing @jeswr so he's aware of this.

RubenVerborgh commented 2 years ago

@pbonte Thanks for submitting this one!

Quick question—the intro says:

Users can have links to large knowledge bases within their pod. However, when performing querying under a certain entailment regime, i.e. with reasoning enabled, the whole large knowledge base might need to be inspected.

However, from the way it is phrased to me, it is not clear whether the inspection of the whole knowledge base is a given or an issue to be fixed, and if so, if the amount of data being transferred (or the total time?) is a quality attribute. Could you adjust the description to clarify this? Thanks!

RubenVerborgh commented 2 years ago

@pbonte Also, can we apply this (currently generic) challenge to a specific use case, such that the demo becomes a very specific and concrete target to reach? We could either adjust the current challenge text, or create a new issue that is basically an applied version of this bigger challenge, and link back to here?

jeswr commented 2 years ago

I'm planning on creating a small (perhaps dummy) demo app to show the principle of my honours work as part of a mid-term presentation I have to give next week (Tuesday 1/3/22) - so happy to try and align this with a relevant use case here if I can.

A tentative plan (off the top of my head) was to do some kind of inference to materalize facts about diseases individuals may be predisposed to - with reasoning used to:

'Normalize' predicates describing the relationships between people (e.g. parent + male -> father)
Infer conditions your relatives/ancestors have based on any records they have shared with you.
Federate with a larger medical database that defines the relationships in genetic conditions (e.g. the database specifies that if your grandfather is bald then you are bald) to establish whether you are predisposed to certain conditions based on whether your relatives have them.

So in this case you are interested in using just one fact of many from a (potentially large) genetic database, being:
```
ex:Baldness ex:predisposedIfAppearingIn ex:grandFather
```

And federating this with shared information about your grandFather to establish whether or not you are likely to be bald.

Of course this could be then similarly applied to many other conditions.

PS. I'm far from a medical domain expert and I'm absolutely butchering the data modelling here - this is just designed to be a proof of concept :).

jeswr commented 2 years ago

Something else that could be exploited when reasoning against such databases is that they usually already have some form of reasoning applied to them. @pbonte do you know if there is any research so far into making use of this fact so as to reduce the amount of reasoning that needs to be done on the application side.

In the context of the work that I am doing with Comunica for my honours - I was thinking of having a context annotation for each data source which indicated which types of reasoning had already been applied to them - with the thought in the long term being that data sources should provide metadata about any form of 'pre-reasoning' that has been done on them.

edit Some vaguely related works

Eﬀicient reasoning on large-scale heterogeneous data; I've sent an email to the author to see if they have any thoughts on this particular case
A Survey of Large-Scale Reasoning on the Web of Data

pbonte commented 2 years ago

This would indeed be very interesting and allow one to save a lot of computational efforts, however, I think it's risky to make the assumption that (decentralized) knowledge bases would maintain up to date metadata about their materialisation status. I can imagine that not all knowledge bases would incrementally update their materialisation when some new facts are added, thus requiring the metadata to also maintain a list of unprocessed facts. Furthermore, this would only make sense when the same logic is used. Lets say the remote knowledge base uses RDFS, while the client uses some more expressive rule bases language (N3/SWRL/OWL2 RL), then its not completely clear how we can reuse the RDFS materialization as intermediate results. Perhaps future research can show us.

If however, knowledge bases do maintain this metadata, and the logics can somehow be aligned, then yes, this would be very interesting! On top of my head i think it would mean the following:

using forward chaining, one would only need to add the facts that directly link to the remote knowledge bases and apply the materialization on these new facts. They will use the materialized view on the remote knowledge base to see if any rules match. This is very similar to incremental reasoning, where adding is cheap and removal is expensive, however, here we are only adding so that is fine.
using backward chaining, I think after a certain depth, we dont need to apply any rules anymore, as they already have been deducted. This thus allows to stop the backward chaining process much faster.

RubenVerborgh commented 2 years ago

I think it's risky to make the assumption that (decentralized) knowledge bases would maintain up to date metadata about their materialisation status

Yes, but also see it the other way: we are in the position to make recommendations and specs.

So your IF can be enforced if that's a recommendation we can argue 😃

jeswr commented 2 years ago

Furthermore, this would only make sense when the same logic is used. Lets say the remote knowledge base uses RDFS, while the client uses some more expressive rule bases language (N3/SWRL/OWL2 RL), then its not completely clear how we can reuse the RDFS materialization as intermediate results. Perhaps future research can show us.

One step may be to create a document that defines the relationships between various rule languages, and respectively the relationships between rule sets that can be defined within those languages. The former being more useful from a technical perspective to achieve things like https://github.com/comunica/comunica-feature-reasoning/issues/22 whilst the latter can be used to inform whether the remote source has materialised all of the implicit facts that would have been materialized by the rule set being used for the federated reasoning.

In the case that you mention where only a subset of the implicit data has been produced by the remote source; we could perhaps extend this idea to re-use the intermediate results by identifying the diff of which rules haven't been applied to the remote source and only apply those to the remote source in step 1 of the naive algorithm mentioned in https://github.com/comunica/comunica-feature-reasoning/issues/23.

Of course this still begs the question of how to handle implicit data within the remote source that would not have been produced by the rule set you are using; i.e. do we need to remove anything from our results?