how to interpolate smooth distributions?

priamai commented 9 months ago

Hi there, I guess I may have already hit a limitation with the library. Any help would be great, maybe I have to move to a more complex solution. Anyway here's my issue:


def example_learning():

    import pandas as pd

    samples = pd.DataFrame({"Host":["carl","ermano","jon"],
                       "Detection":["PsExec","PsExec","PsExec"],
                        "Outcome":["TP","FP"],
                       "HourOfDay":[5,10,13]})
    print(samples)

    structure = hh.structure.chow_liu(samples)

    bn = hh.BayesNet(*structure)
    bn = bn.fit(samples)
    bn.prepare()
    '''
    dot = bn.graphviz()

    path = dot.render('asia', directory='figures', format='svg', cleanup=True)
    '''
    print("Probability of detection")
    print(bn.P["Detection"])

    print("Probability of outcome")
    print(bn.P["Outcome"])
    print("Probability of FP at 5 am")
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
    bn.predict_proba(event)
    print("Probability of FP at 6 am")
    # this will fail because is unseen: how do we generalize?
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

I want to predict the probably of a false positive at 6 am which was not observed in the training set. I am not sure what is the correct approach here is there a way to assign a smooth distribution across the 24hours so that it will assign a tiny probability that is unobserved?

How other libraries like Pomegrenade handle this kind of situations? Cheers!

MaxHalford commented 9 months ago

Hey there! Is that example running for you? It spits an error at me because the inputs to the pandas DataFrame are not equal in length.

I know exactly the issue you're having. One way is to make each possibility appear at least once in the dataframe you provide to fit. That way, each possibility has been seen at least one, so the probability of any even will be greater than 0.

priamai commented 9 months ago

So in my case I would have to make sure that all the 24 hours are available before I can make a prediction.

priamai commented 9 months ago

However this still happens:

import sorobn as hh

def example_dag():
    # simple equivalent notion
    bn = hh.BayesNet(
    ('Host', 'Alarm'),
    ('Alarm', 'True Positive'),
    ('Alarm', 'False Positive'),
    seed=42,
    )

    bn = hh.BayesNet((["Host"],"Alarm"),
                     ("Alarm",["True Positive","False Positive"]),seed=42)

def example_learning():

    import pandas as pd

    samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
                       "Detection":["PsExec","PsExec","PsExec","Quarantine"],
                        "Outcome":["TP","FP","FP","TP"],
                       "HourOfDay":[5,10,13,6]})
    print(samples)

    structure = hh.structure.chow_liu(samples)

    bn = hh.BayesNet(*structure)
    bn = bn.fit(samples)
    bn.prepare()
    '''
    dot = bn.graphviz()

    path = dot.render('asia', directory='figures', format='svg', cleanup=True)
    '''
    print("Probability of detection")
    print(bn.P["Detection"])

    print("Probability of outcome")
    print(bn.P["Outcome"])
    print("Probability of FP at 5 am")
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
    bn.predict_proba(event)
    print("Probability of FP at 6 am")
    # this will fail because is unseen: how do we generalize?
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

example_learning()

The 6 am is now available but if I don't provide an exact similar example from the dataset it complains. I am a bit sceptical of the applicability, one would expect a simple level of generalization.

priamai commented 9 months ago

So just to be clear the first event works because is identical of what is in the dataset but the second one fails as it doesn't seem to interpolate the probability...

    # this is fine but is exactly the same event ....
    event = {"Host":"albert","Detection":"Quarantine","HourOfDay":6}
    bn.predict_proba(event)

    # this still fails...
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

MaxHalford commented 9 months ago

What I was saying is that you can ensure every case is seen by the BN by calculating a Cartesian product between all values. This way, each occurrence appears at least once. This works:

import itertools
import sorobn as hh
import pandas as pd

samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
                    "Detection":["PsExec","PsExec","PsExec","Quarantine"],
                    "Outcome":["TP","FP","FP","TP"],
                    "HourOfDay":[5,10,13,6]})

unique_values = [samples[col].unique() for col in samples.columns]
cartesian_product = list(itertools.product(*unique_values))
cartesian_df = pd.DataFrame(cartesian_product, columns=samples.columns)

structure = hh.structure.chow_liu(samples)

bn = hh.BayesNet(*structure)
bn = bn.fit(pd.concat([samples, cartesian_df]))
bn.prepare()
'''
dot = bn.graphviz()

path = dot.render('asia', directory='figures', format='svg', cleanup=True)
'''
print("Probability of detection")
print(bn.P["Detection"])

print("Probability of outcome")
print(bn.P["Outcome"])
print("Probability of FP at 5 am")
event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
bn.predict_proba(event)
print("Probability of FP at 6 am")
# this will fail because is unseen: how do we generalize?
event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
bn.predict_proba(event)

I agree that this should be a smoother experience. The BN could use an a priori and output a (very) low probability for cases not seen in the training data. I don't have to work on this right now, but I will. In the meantime, this Cartesian product trick should work.

MaxHalford / sorobn

how to interpolate smooth distributions? #28