fujaba / org.fujaba.graphengine

This project's aim is to build a graph engine, that is able to build and compare graphs - and to match patterns and apply actions on the graph, to effectively use it as a graph-transformation framework.
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

try to improve performance of isomorphism checks and pattern matching #28

Open hoechp opened 7 years ago

hoechp commented 7 years ago

try to improve performance of isomorphism checks and pattern matching!

hoechp commented 7 years ago

now the pattern matching works like the csp-based isomorphism check algorithm and seems to be a lot faster.

hoechp commented 7 years ago

since now, both the probably best isomorphism check and my only pattern matching algorithm are based around something that resembles solving a Contraint Satisfaction Problem, I could use heuristics for that, to improve performance. there's:

  1. the heuristics of the maximum restricted variable
  2. the heuristics of the minimum node order
  3. the heuristics of the minimal conflict

how well are those applicable?

  1. the maximum restricted variable in my code is the 'subGraph'- or 'pattern'-node with the lowest number or elements in it's couldMatch-list. Since the possible matches were already calculated to restrict the 'variables' as much as initially possible, this sounds good. I could just find the node with the minimal candidate count, looking at the couldMatch-lists.
  2. the node with mininum order in my code is the 'subGraph'- or 'pattern'-node with the highest number of outgoing (as well as ingoing) edges. I think, in my code, mostly the outgoing edges matter - only for negative nodes, incoming edges are seperately checked. So it would be easy (and fast) to check for the node with the most outgoing edges. This could probably help.
  3. the mapping of a 'subGraph'- or 'pattern'-node to a node of the graph to match is the one, that would make the least amount of other selections fail. here I'm not sure if it's worth the calculation - I think it could even easily backfire, but I'm not sure...

A feasable combination, as teached in the A.I. lecture, would be to first use (1), then choose between those with the same number of candidates using (2). Possibly even selecting with (3). But the way my algorithm works, I cant just change the order of my checks, or go ahead and skip several ones.

Without changing anything else pretty much, I could select the first node with (1) and (2) and order the matches for all nodes by (3) - but only regarding the initial situation, before any chosen candidates are considered.

At least my code would look more academic, if I tried to do all this - but I have to just try it out, to see if it really helps at all.

(intuitively I think, for checks that mostly fail, this will backfire - because the additional calculations will only delay the fail after complete checking. but for successful matches, this will probably improve performance notably - so the ratio of average success or fail per check is important for the potential speedup)

hoechp commented 7 years ago

I implemented all sorts of heuristics, but it's still pretty improvised code with bad performance and all kind of unnecessary calculations. One of these versions runs a bit faster, than my other handlers. For example the previously fastest handler for the roadwork example took 7 minutes on my pc, while one of these new handlers takes 5 minutes - despite the kind of bad code it uses.

I'm sure, that it wouldn't take much effort to do a proper implementation of this concept of a 'true' CSP-oriented isomorphism check, that uses all performant heuristics it can utilize, whenever possible.

hoechp commented 7 years ago
(original) starting to build reachability graph for RoadworkExample...
done building reachability graph for RoadworkExample after 232838.115 ms.
== 232.838324443 s
== 3.8806395861 m
== 0.06467733189666666 h
4352 nodes in the 'reachabilityGraph'

It did already complete within under 4 minutes. But there's still room for improvement...

hoechp commented 7 years ago

Quick update: I started caching node-evaluator-objects, now knowing they were built way too often - thanks to Christoph's profiling efforts.

Additionally I adjusted the building of reachability graphs with my 'normal form'/'sort'-based algorithm to make better use of the normal form, and I also increased performance of the sorting itself. For the roadwork example this drastically improved the performance from the slowest or one of the slowest algorithms to the best one.

Now it takes 106471.3 ms for the 'normal form'-based reachability graph, while SDMLib takes 75162.6 ms - so It's just taking 40% longer for YAGE now, instead of 210% longer (which was with another - csp-based - algorithm, that was performing best back then) like claimed in my bachelor thesis.

SDMLib is now just 30% quicker than YAGE for this example - with bigger examples, YAGE would probably perform even better - potentially beating SDMlib because of the O(n) complexity of each comparisons within the normalized reachability graph. Of course building the normalized graphs takes some effort, too - but it is way better, than doing full-blown isomorphic checks for each pair of graphs.

hoechp commented 7 years ago

Another test now showed, that SDMLib actually does scale pretty well. For a bigger map of the RoadworkExample - driving SDMLib almost to an OutOfMemory error - YAGE took not 40% but 70% longer than SDMLib. The reason probably is, that SDMLib uses hashes to eliminate most unnecessary isomorphism checks - which is similar to the concept of the sort-based algorithm, but more performant. While YAGE is able to avoid classic isomorphism checks alltogether, in this example it didn't bring the better scaling in performance that could be expected. For examples though, where SDMLib's hashing wouldn't help that much, you could probably see the effect.

One way to keep increasing performance, would probably be, to introduce something like SDMLib's hashing in YAGE, too. For the Strings used in sort-based reachability graphs, one could simply hash the Strings for a better performance - a general solution though, would be to hash as much information of the graph, as is easily accessible, that would have to be different for each non-isomorphic graph. For example the number of nodes, the number of edges or the sum of the hashvalue of data from each edge - and the same for attributes. I suspect SDMLib to use a similar approach. If all this information should be extracted, this shouldn't happen too often. For example this should happen for each result-graph within building the reachability graph. the hash could then be used to eliminate most unnecessary isomorphism checks and be stored for further comparison, if the graph turns out to be new. This then would potentially help all implementations of isomorphism checks, not just the one for the sort-based one. Though the sort-based reachability graph building works different and should probably use its own kind of hashing - just with the Strings.

hoechp commented 7 years ago

Okay, after introducing hashing of graphs within the reachability graph, YAGE now has the same runtime as SDMLib for the original RoadworkExample (using a csp-based algorithm again). SDMLib: 74907 ms; YAGE: 76004 ms. Ok, YAGE took 1.1 seconds longer, but thats just 1.5% longer runtime.

edit: another csp-based algorithm - using more heuristics - even is faster than SDMLib for the original RoadworkExample (YAGE: 64512 ms; SDMLib: 76004 ms). So SDMLib takes 18% longer runtime for this test (while in comparision to using no heuristics, I managed to eliminate a third of the computational effort with the pure usage of basic csp heuristics - without them, I wouldn't be able to beat SDMLib here). More tests coming up.

hoechp commented 7 years ago

The average of five tests each shows 64141 ms for the fastest YAGE algorithm (csp with high heuristics usage) for the RoadworkExample and 74784 ms for SDMLib, which is a slight 16%-17% slower for this test. YAGE's csp-based algorithm with low heuristics usage scores 73229 ms. Without heuristics YAGE's depth-first backtracking takes 96041 ms. YAGE's sort-based/normal form algorithm takes 106591 ms. The combinatorial algorithm and a badly implemented new parallel algorithm though take many times longer than all previously mentioned algorithms. It seems that the algorithms that perform well - or even are faster than SDMLib - are scaling worse than SDMLib though. Tests for the current implementation with a bigger map for the RoadworkExample are running for the next few hours.

edit: (test with a bigger map for the RoadworkExample almost causing a OutOfMemory error in SDMLib)

the original example:

the minimal example:

It already looks like YAGE is not bad, but has an issue with scaling. I have to look into this...

hoechp commented 7 years ago

I now started to not check isomorphisms by calculating homomorphisms first and checking if they work backwards, but directly checking isomorphisms themselves in all depth-first backtracking type algorithms. this drasticly increases the efficiency of the backtracking algorithms and most notably their heuristics, that kind of got sabotaged by the old technique. So here the new performance test runs - again with the RoadworkExample:

Bigger map with 13 tracks:

Original map with 11 tracks:

Minimal map with 5 tracks:

Finally one algorithm always is the best - and better than SDMlib even with the big examples. But still, SDMLib seems to scale better, so for much bigger examples on machines with the proper RAM and so on - SDMLib would probably beat YAGE. On the other hand: maybe the scalability of YAGE could be improved. I have to look deeper into that...

So the newest improvements mostly helped the csp high heuristics usage algorithm (compared to last version):

hoechp commented 7 years ago

Treating graph isomorphisms as CSP is a very promising approach. So here's the math: 20170302_001013 This is a way of defining a graph isomorphism (for the types of graphs used in my bachelor thesis) as a CSP (as defined in the artificial intelligence lecture). The last two conditions of the variables' domains are optional, but very important for a proper performance, if you actually implement an algorithm around this.

The best YAGE algorithm (csp-based depth-first backtracking with dynamic variable ordering heuristics) is pretty much exactly the CSP shown here being solved using heuristics like teached in the artificial intelligence lecture.