ML-KULeuven / problog

ProbLog is a Probabilistic Logic Programming Language for logic programs with probabilities.
https://dtai.cs.kuleuven.be/problog/
317 stars 35 forks source link

Exponential runtime increase #90

Closed smejkka3 closed 2 years ago

smejkka3 commented 2 years ago

Hey All,

I am having some issues with increasingly exponential runtime using predefined rule inference of facts after upgrading to python 3.10 from 3.8.

I have several rules implemented in python 3.10 and a .txt file with triples which are used as facts represented as Var().

I'm using the ProbLog in python program (from PyPi, version 2.2.3).

The program is defined as followed:

        TRIPLE = Term('triple')
        NEWTRIPLE = Term('newtriple')
        query = Term('query')
        SUBCLASS = Term('<http://www.w3.org/2000/01/rdf-schema#subClassOf>')
        TYPE_ = Term('<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>')
        DOMAIN_ = Term('<http://www.w3.org/2000/01/rdf-schema#domain>')
        RANGE_ = Term('<http://www.w3.org/2000/01/rdf-schema#range>')
        SUBPROPERTY = Term('<http://www.w3.org/2000/01/rdf-schema#subPropertyOf>')

        X, Y, Z, R, R1 = Var('X'), Var('Y'), Var('Z'), Var('R'), Var('R1')
        base = NEWTRIPLE(X, R, Y) << TRIPLE(X, R, Y)
        rdfs2 = NEWTRIPLE(X, TYPE_, Y) << ( TRIPLE(R, DOMAIN_, Y) & NEWTRIPLE(X, R, Z) )
        rdfs3 = NEWTRIPLE(X, TYPE_, Y) << ( TRIPLE(R, RANGE_, Y) & NEWTRIPLE(Z, R, X) )
        rdfs9 = NEWTRIPLE(X, TYPE_, Y) << ( TRIPLE(Z, SUBCLASS, Y) & NEWTRIPLE(X, TYPE_, Z) )
        rdfs5 = NEWTRIPLE(X, SUBPROPERTY, Y) << ( TRIPLE(X, SUBPROPERTY, Z) & NEWTRIPLE(Z, SUBPROPERTY, Y) )
        rdfs7 = NEWTRIPLE(X, R, Y) << ( NEWTRIPLE(R1, SUBPROPERTY, R) & TRIPLE(X, R1, Y) )
        rdfs11 = NEWTRIPLE(X, SUBCLASS, Y) << ( TRIPLE(X, SUBCLASS, Z) & NEWTRIPLE(Z, SUBCLASS, Y) )

        p = SimpleProgram()

        for t in self.triples:
            p += TRIPLE(Term(t[0]),Term(t[1]),Term(t[2]), p=t[3]) 

        p += base   
        p += rdfs2
        p += rdfs3
        p += rdfs5
        p += rdfs7
        p += rdfs9
        p += rdfs11

        p += query(NEWTRIPLE(X, R, Y))
        p += query(NEWTRIPLE(X, TYPE_, Y))
        p += query(NEWTRIPLE(X, SUBPROPERTY, Y))
        p += query(NEWTRIPLE(X, SUBCLASS, Y))

        start = time.process_time()
        result = get_evaluatable().create_from(p).evaluate()

The Variables are replaced by data from yago-4 dataset: https://yago-knowledge.org/data/yago4/en/2020-02-24/yago-wd-schema.nt.gz https://yago-knowledge.org/data/yago4/en/2020-02-24/yago-wd-annotated-facts.ntx.gz

As the dataset is very big I started using only small subset 1000 (1min), 2000 (5mins), 5000 (20mins), 10000(40mins), 50000(2.5h) triples(runtime). When I run it for 100k + triples the program didn't finish even after ~35 hours. Is this expected behavior for the rules I'm using? I understand there could be a cycles within the data, but as I remember I was running the same code for the same dataset with more then 1mil triples and it finished overnight when i had python 3.8 installed on the server.

The rules above is recursive version of https://www.researchgate.net/figure/RDF-RDFS-entailment-rules_tbl1_268419911 to create linearity in the rules so I'm not running into the infinite loops in the dataset.

VincentDerk commented 2 years ago

Hi,

Without having tried on Python 3.10 myself yet, did you make sure both cases are comparable? With this I mean: if you installed PySDD on Python 3.8, did you also install it for Python 3.10 and vice versa?

smejkka3 commented 2 years ago

Hi,

thank you very much for response. I installed PySSD and it didn't help. However I found out that suppling probability to the Term was causing this, even though the probability was set to 1.0 in the first iteration to every term. Is there particular reason why adding probability would be causing such a dramatic slowdown in Python?

VincentDerk commented 2 years ago

Not sure whether I understood you correctly. In general, if you have a weight of one that never changes (e.g. 1.0::a.), it is best to remove it (e.g. a.). In this way the term is deterministic rather than probabilistic and it may be optimised out during the grounding. The run time complexity of weighted model counting / knowledge compilation, which is used for the exact probabilistic inference, is worst case exponential in the number of variables so allowing variables to be optimized out may help.

For large programs where you query few variables the "-k sdd" compilation (automatic when you install PySDD) may work best due to its used encoding.

Is there still a significant run time difference between Python 3.8 and Python 3.10 when they are compared using the same encoding? i.e. both PySDD (-k sdd) or both dSharp (-k ddnnf).

smejkka3 commented 2 years ago

Hi,

Your answer makes sense; I made my program deterministic, which made it run faster.

I did some evaluations on the python versioning on three different systems, and python 3.10, and python 3.8 and python 3.10 is slower in every case. Still, even for datasets with many variables, the difference wasn't big and didn't increase dramatically. Approx 5% slowdown in python 3.10 compared to python 3.8