felixleopoldo / benchpress

A Snakemake workflow to run and benchmark structure learning (a.k.a. causal discovery) algorithms for probabilistic graphical models.
https://benchpressdocs.readthedocs.io
GNU General Public License v2.0
65 stars 17 forks source link

Add pyAgrum package and MIIC algorithm? #115

Open bdatko opened 5 months ago

bdatko commented 5 months ago

I think pyAgrum would be a great addition to the list of algorithms. To my eyes, it did not look like there was a comparison in benchpress using the Multivariate Information-based Inductive Causation (MIIC) algorithm which pyAgrum has implemented. The library also offer a scikit-learn interface to learn classifiers which should help with the integration into benchpress.

felixleopoldo commented 5 months ago

Hi, that sounds like a good idea. In pyAgrum they call the useMIIC function on a learner object (link) and link, but it's not totally clear how to pass arguments to the algorithm, like choosing score or test function. Do you have some sample usage? MIIC also seems to be implemented here. Do you know which one to prefer?

bdatko commented 5 months ago

@felixleopoldo The useMIIC is the their lower-level API, but there is a convenience class pyAgrum.skbn.BNClassifier where the default choice of learningMethod is MIIC. The other choice for learningMethod are: Chow-Liu, NaiveBayes, Tree-augmented NaiveBayes, MIIC + (MDL or NML), Greedy Hill Climb, Tabu. You can use scoringType within the initializer of pyAgrum.skbn.BNClassifier to pick your flavor: AIC, BIC, BD, BDeu, K2, Log2.

There are examples of using pyAgrum.skbn.BNClassifier within this notebook titled Learning classifiers, shown below is a call using MIIC (cell 7 from the linked notebook):

#we use now another method to learn the BN (MIIC)
BNTest= skbn.BNClassifier(learningMethod = 'MIIC', prior= 'Smoothing', priorWeight = 0.5,
                          discretizationStrategy = 'quantile', usePR = True, significant_digit = 13)

xTrain, yTrain = BNTest.XYfromCSV(filename = 'res/creditCardTest.csv', target = 'Class')

More examples using BNClassifier can be found in the notebook titled Comparing classifiers (including Bayesian networks) with scikit-learn.

I have only used pyAgrum because I don't know R so, I have never directly compared the two. pyAgrum is a Python wrapper around the aGrum C++ library where their MIIC implementation is sourced in C++. It looks similar to how the original authors of MIIC provide a C++ implementation wrapped in R, but I don't know for sure.

Let me know if you need any more help. =)

felixleopoldo commented 5 months ago

Thanks. It seems like they refer to the Bayesian network as a classifier, where one is specified as Target? It would be nice if you could show how to do the following two steps:

  1. Learn the graph of a Bayesian network from a CSV data file (in the Benchpress data format) using with relevant parameters for structure learning
  2. Write the adjacency matrix representation of the graph to a CSV file following Benchpress graph format
bdatko commented 5 months ago
  1. Learn the graph of a Bayesian network from a CSV data file (in the Benchpress data format) using with relevant parameters for structure learning

I hope the example below demos what you need.

  1. Write the adjacency matrix representation of the graph to a CSV file following Benchpress graph format

From what I know, there isn't any convenient writer to save the adjacency matrix to CSV so, shown below is a small helper to save the matrix in the format for benchpress.

The example assumes you have the following installed in your environment: pyAgrum, pandas, scikit-learn. You will need all three to run the example below.

import csv
from pathlib import Path

import pandas as pd
import pyAgrum.skbn as skbn
from pyAgrum import BayesNet

def adjacency_to_csv(bn: BayesNet, *, to_file: str):

    id_to_name = {bn.idFromName(name): name for name in bn.names()}

    with Path(to_file).open(mode="w", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        # write header
        writer.writerow(id_to_name[col_id] for col_id in range(bn.size()))
        #write rows
        adj_mat = bn.adjacencyMatrix()
        writer.writerows(row for row in adj_mat)

data = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
).dropna()

data.to_csv("fully_obs_titanic.csv", index=False)

classifier = skbn.BNClassifier(learningMethod="MIIC", scoringType="BIC")
xdata, ydata = classifier.XYfromCSV(filename="fully_obs_titanic.csv", target="survived")
classifier.fit(xdata, ydata)

adjacency_to_csv(classifier.bn, to_file="resulting_adjacency.csv")

Here is the resulting adjacency matrix:

❯ cat resulting_adjacency.csv
survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,1,0,0,0,1,0,0,0,0,0

I ran this example with the following environment:

Python 3.11.7
numpy               1.26.4
pandas              2.2.2
pyAgrum             1.14.0
scikit-learn        1.5.0
scipy               1.13.1
felixleopoldo commented 5 months ago

Thanks a lot. So for the target variable (survived), can we just choose the first one in the order?

bdatko commented 5 months ago

For the fit method of BNClassifier you can specify any column within the CSV file, see here. Shown below is the snippet for the target

Fits the model to the training data provided. The two possible uses of this function are fit(X,y) and fit(data=…, targetName=…). Any other combination will raise a ValueError

  • targetName (str) – specifies the name of the targetVariable in the csv file. Warning: Raises ValueError if either X or y is not None. Raises ValueError if data is None.
felixleopoldo commented 5 months ago

Ok!

phwuil commented 5 months ago

Hi @felixleopoldo , many thanks to @bdatko for this "issue".

Actually, BNClassifier is based on the BNLearner class. If you want to test the learning algorithms of pyAgrum, you should use BNLearner. MIIC is a "constraint-based" method based on mutual information. There is no score but one can apply corrections (MDL/NML). Of course, you can add some priors for the parameters approximation.

import pyAgrum as gum
learner=gum.BNLearner("test.csv") # MIIC is used as default (some score-based are also implented)
learner.useMDLCorrection() # for small dataset
learner.useSmoothingPrior() # smoothing (default weight=1) for parameters
bn=learner.learnBN() # learning

Thanks again to @bdatko. Please tell me if you need some other snippets :-)

felixleopoldo commented 5 months ago

Hi @phwuil, thanks for the snippet. Could you show how MIIC could be run on continuous data too?

phwuil commented 5 months ago

hi @felixleopoldo , thank you for that. pyAgrum is mainly about discrete variables. However there are 2 solutions for continuous data : 1- automatic discretization 2- CLG (experimental python model)

1- automatic discretisation with pyAgrum.skbn.BNDiscretizer

import pyAgrum as gum
import pyAgrum.skbn as skbn

filename="test.csv"
# BNDiscretizer has many options 
disc=skbn.BNDiscretizer()
template=disc.discretizedBN(filename)

# template contains all the (discrete variables) 
# that will be used for the learning
learner=gum.BNLearner(filename,template)
learner.useMDLCorrection()
learner.useSmoothingPrior()
bn=learner.learnBN()
phwuil commented 5 months ago

2- CLG : new CLG implementation in pyAgrum 1.14.0 pyAgrum.CLG tutorial

import pyAgrum.clg as gclg
# no hybrid learning : pure clg data
learner = clg.CLGLearner(filename)
clg = learner.learnCLG()
felixleopoldo commented 5 months ago

OK. There is a new pyagrum branch, where you can try pyagrum by snakemake --cores all --use-singularity --configfile workflow/rules/structure_learning_algorithms/pyagrum/pyagrum.json --rerun-incomplete If you know any data scenario where it performs well, let me know!

phwuil commented 5 months ago

Hi @felixleopoldo, thank you for this. I have to admit that I did not know before it was pointed out to me by @bdatko. Thanks for both of you. So I will have to learn how to use it. :-) (if you have THE good ref to help, please tell me :-) !)

felixleopoldo commented 5 months ago

I see, no worries:) If you mean the main reference to Benchpress it is here. It is not mentioned there, but you can also run it under WSL on Windows.