davenza / PyBNesian

PyBNesian is a Python package that implements Bayesian networks.
MIT License
38 stars 10 forks source link
bayesian-networks kernel-density-estimation python

build Documentation Status PyPI

PyBNesian

Implementation

Currently PyBNesian implements the following features:

Models

which can have different types of CPDs:

with this combinations of CPDs, we implement the following types of networks (which can also be Conditional or Dynamic):

Graphs

Graph classes implement useful functionalities for probabilistic graphical models, such as moving between DAG-PDAG representation or fast access to root and leaves.

Learning

It implements different structure learning algorithms:

The score and search algorithms can be used with the following scores:

and the following the following learning operators:

The following independence tests are implemented for the constraint-based algorithms:

It also implements the parameter learning:

Inference

Not implemented right now, as the priority is the learning algorithms. However, all the CPDs and models have a sample() method, which can be used to create easily an approximate inference engine based on sampling.

Serialization

All relevant objects (graphs, CPDs, Bayesian networks, etc) can be saved/loaded using the pickle format.

Other implementations

PyBNesian exposes the implementation of other models or techniques used within the library.

Weighted sums of chi-squared random variables:

Usage example

>>> from pybnesian import GaussianNetwork, LinearGaussianCPD
>>> # Create a GaussianNetwork with 4 nodes and no arcs.
>>> gbn = GaussianNetwork(['a', 'b', 'c', 'd'])
>>> # Create a GaussianNetwork with 4 nodes and 3 arcs.
>>> gbn = GaussianNetwork(['a', 'b', 'c', 'd'], [('a', 'c'), ('b', 'c'), ('c', 'd')])

>>> # Return the nodes of the network.
>>> print("Nodes: " + str(gbn.nodes()))
Nodes: ['a', 'b', 'c', 'd']
>>> # Return the arcs of the network.
>>> print("Arcs: " + str(gbn.nodes()))
Arcs: ['a', 'b', 'c', 'd']
>>> # Return the parents of c.
>>> print("Parents of c: " + str(gbn.parents('c')))
Parents of c: ['b', 'a']
>>> # Return the children of c.
>>> print("Children of c: " + str(gbn.children('c')))
Children of c: ['d']

>>> # You can access to the graph of the network.
>>> graph = gbn.graph()
>>> # Return the roots of the graph.
>>> print("Roots: " + str(sorted(graph.roots())))
Roots: ['a', 'b']
>>> # Return the leaves of the graph.
>>> print("Leaves: " + str(sorted(graph.leaves())))
Leaves: ['d']
>>> # Return the topological sort.
>>> print("Topological sort: " + str(graph.topological_sort()))
Topological sort: ['a', 'b', 'c', 'd']

>>> # Add an arc.
>>> gbn.add_arc('a', 'b')
>>> # Flip (reverse) an arc.
>>> gbn.flip_arc('a', 'b')
>>> # Remove an arc.
>>> gbn.remove_arc('b', 'a')

>>> # We can also add nodes.
>>> gbn.add_node('e')
4
>>> # We can get the number of nodes
>>> assert gbn.num_nodes() == 5
>>> # ... and the number of arcs
>>> assert gbn.num_arcs() == 3
>>> # Remove a node.
>>> gbn.remove_node('b')

>>> # Each node has an unique index to identify it
>>> print("Indices: " + str(gbn.indices()))
Indices: {'e': 4, 'c': 2, 'd': 3, 'a': 0}
>>> idx_a = gbn.index('a')

>>> # And we can get the node name from the index
>>> print("Node 2: " + str(gbn.name(2)))
Node 2: c

>>> # The model is not fitted right now.
>>> assert gbn.fitted() == False

>>> # Create a LinearGaussianCPD (variable, parents, betas, variance)
>>> d_cpd = LinearGaussianCPD("d", ["c"], [3, 1.2], 0.5)

>>> # Add the CPD to the GaussianNetwork
>>> gbn.add_cpds([d_cpd])

>>> # The CPD is still not fitted because there are 3 nodes without CPD.
>>> assert gbn.fitted() == False

>>> # Let's generate some random data to fit the model.
>>> import numpy as np
>>> np.random.seed(1)
>>> import pandas as pd
>>> DATA_SIZE = 100
>>> a_array = np.random.normal(3, np.sqrt(0.5), size=DATA_SIZE)
>>> c_array = -4.2 - 1.2*a_array + np.random.normal(0, np.sqrt(0.75), size=DATA_SIZE)
>>> d_array = 3 + 1.2 * c_array + np.random.normal(0, np.sqrt(0.5), size=DATA_SIZE)
>>> e_array = np.random.normal(0, 1, size=DATA_SIZE)
>>> df = pd.DataFrame({'a': a_array,
...                    'c': c_array,
...                    'd': d_array,
...                    'e': e_array
...                })

>>> # Fit the model. You can pass a pandas.DataFrame or a pyarrow.RecordBatch as argument.
>>> # This fits the remaining CPDs
>>> gbn.fit(df)
>>> assert gbn.fitted() == True

>>> # Check the learned CPDs.
>>> print(gbn.cpd('a'))
[LinearGaussianCPD] P(a) = N(3.043, 0.396)
>>> print(gbn.cpd('c'))
[LinearGaussianCPD] P(c | a) = N(-4.423 + -1.083*a, 0.659)
>>> print(gbn.cpd('d'))
[LinearGaussianCPD] P(d | c) = N(3.000 + 1.200*c, 0.500)
>>> print(gbn.cpd('e'))
[LinearGaussianCPD] P(e) = N(-0.020, 1.144)

>>> # You can sample some data
>>> sample = gbn.sample(50)

>>> # Compute the log-likelihood of each instance
>>> ll = gbn.logl(sample)
>>> # or the sum of log-likelihoods.
>>> sll = gbn.slogl(sample)
>>> assert np.isclose(ll.sum(), sll)

>>> # Save the model, include the CPDs in the file.
>>> gbn.save('test', include_cpd=True)

>>> # Load the model
>>> from pybnesian import load
>>> loaded_gbn = load('test.pickle')

>>> # Learn the structure using greedy hill-climbing.
>>> from pybnesian import hc, GaussianNetworkType
>>> # Learn a Gaussian network.
>>> learned = hc(df, bn_type=GaussianNetworkType())
>>> learned.num_arcs()
2

Dependencies

The library has been tested on Ubuntu 16.04/20.04/22.04 and Windows 10/11, but should be compatible with other operating systems.

Libraries

The library depends on NumPy, Apache Arrow, pybind11, NLopt, libfort and Boost.

Installation

PyBNesian can be installed with pip:

pip install pybnesian

Build from Source

Prerequisites

Building

Clone the repository:

git clone https://github.com/davenza/PyBNesian.git
cd PyBNesian
git checkout v0.5.1 # You can checkout a specific version if you want
pip install .

Testing

The library contains tests that can be executed using pytest. They also require scipy and pandas installed.

pip install pytest scipy pandas

Run the tests with:

pytest

How to cite?

@article{Atienza2022Pybnesian,
    author = {David Atienza and Concha Bielza and Pedro Larrañaga},
    title = {PyBNesian: An extensible Python package for Bayesian networks},
    journal = {Neurocomputing},
    volume = {504},
    pages = {204-209},
    year = {2022}
}

References

[1] D. Atienza and C. Bielza and P. Larrañaga. PyBNesian: An extensible python package for Bayesian networks. Neurocomputing, 504, 2022, pp 204-209.

[2] D. Atienza and C. Bielza and P. Larrañaga. Semiparametric Bayesian networks. Information Sciences, 584, 2022, pp 564-582.

[3] D. Atienza and P. Larrañaga and C. Bielza. Hybrid Semiparametric Bayesian networks. TEST, 31(2), 2022, pp 299-327.

[4] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques, The MIT Press, 2009.

[5] J. Runge, Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 84, 2018, pp. 938–947.

[6] E. V. Strobl and K. Zhang and S., Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019, pp 1-24.