Integrate `pgmpy` for Bayesian networks capabilities

ceteri commented 3 years ago

Integrated pgmpy for statistical inference in Bayesian networks.

Depends on: #26

Ankush-Chander commented 3 years ago

Hey @ceteri,

I need some pointers to understand this requirement better.

Thanks in advance.

ceteri commented 3 years ago

Thank you @Ankush-Chander! Here's an idea, if this seems reasonable as an approach?

There are several kinds of modeling, sampling, and inference implemented by pgmpy, although probably our shortest path is for focusing on Discrete Bayesian? This is also one of the top-requested features to add to kglab from our ongoing survey.

Next steps are:

Build an example Discrete Bayesian model in pgmpy which produces known results – which we can use to verify the integration later
- for example, using one of the examples given in their documentation
- or, ideally, based on data in the recipe progressive example that we use
Represent the data from this model in an RDF graph
Develop a new class method for kglab.KnowledgeGraph or probably even better for kglab.Subgraph that loads the pgmpy model data from the KG
Verify results from above, to use as a unit test

We can also decide whether to have some additional wrappers for pgmpy and its results. On the one hand, it's great to wrap results into pandas dataframes and other conveniences for data science workflows. On the other hand, it's probably better to allow people to simply use pgmpy operations on the model directly. The latter approach is how we've handled integration of PyTorch, PyVis, etc., i.e., not to intermediate unless there are pain points that need to be corrected (as in SPARQL queries).

How does that sound as an approach?

Ankush-Chander commented 3 years ago

Hey @ceteri

I tried to follow above trail but I was not able to find any widely accepted standard rdf representation of bayesian networks. Will need your help in that.

Once we pinpoint that we can provide user a pathway to move from a standard bn rdf file to kg to pgmpy model. Rest of the operation can be done directly using pgmpy endpoints.

Thanks

ceteri commented 3 years ago

Hi @Ankush-Chander, good point! The way I described it above, moving from RDF => pgmpy wouldn't work directly, and there's not standard representation.

What I should have described better:

Choose a simple example Bayesian network problem
Build a solution for it in pgmpy, so we have a known baseline to test against
At that point, I'll represent in RDF (as idiomatic as possible; this becomes simpler after RDF-star is available)
Then we can scope how best to use the Subgraph classes to transform into pgmpy

If the selected example problem can involve the "progressive example" of recipes used in the tutorial, that would be ideal. Although that's not necessary first for us to build out an integration. The initial test case should be simple, as the priority. We can always construct recipe examples later :)

Does that describe the problem better?

The intention for this is to illustrate how to use a completely different graph technology (Bayesian networks) on graph data, which can complement the other approaches we have with NetworkX, RDFlib, pslpython, PyTorch, etc.

Many thanks, Paco

Ankush-Chander commented 3 years ago

Hey @ceteri,

Took a while to get my head around Bayesian inferencing.

Here"s the test example.

P.S: Original cancer model although simple made some very gloomy assumptions, so I had to choose something positive :) I hope it"s simple enough for our purpose

3. At that point, I'll represent in RDF (as idiomatic as possible; this becomes simpler after RDF-star is available)

Any pointers on step 3 will be helpful for me to continue.

Thanks in advance, Ankush

ceteri commented 3 years ago

Wonderful, thank you @Ankush-Chander !

Now I get to wrangle with some RDF representation, hopefully with not too much reification required :)

DerwenAI / kglab

Integrate `pgmpy` for Bayesian networks capabilities #47