ciaranm / glasgow-subgraph-solver

A solver for subgraph isomorphism problems, based upon a series of papers by subsets of McCreesh, Prosser, and Trimble.
MIT License
62 stars 23 forks source link

Using Glasgow to solve the subgraph isomorphism problem locally #6

Open lenksix opened 3 years ago

lenksix commented 3 years ago

Hello, I am trying to address the following problem with the Glasgow Solver: Given a possibly large undirected and unlabelled graph G, I need to enumerate all the (not-induced) subgraphs, t, that are isomorphic to a given small target graph T such that these subgraphs t contain a fixed edge e of G. This can be viewed as a local (to the edge e) version of the subgraph isomorphism problem.

Let me assume the target to be symmetric (and all the nodes to have the same degree, such it is in a triangle or a square target graph) for simplicity of discussion. My idea was to assign a colour to the edge e of G that needs to be contained in all t subgraphs in output, say red, and assign the same colour to one of the edges of the target graph T (since the pattern is symmetric any edge is fine).

I implemented the above procedure by first implementing the API’s of Glasgow in my own program, then I instantiated a default HomorphismParams object, loading G and T in two InputGraph objects instantiated with the correct number of nodes, undirected and labelled=true. I set the solver to the default params as done in the "glasgow_subgraph_solver" when executed with the flag --print-all-solutions. Furthermore, I set the callback to save in a vector all the mappings identified by the solver. To check for correctness of the above procedure, I loaded the input graph with an external library (using an adjacency list) and implemented a hand-made algorithm for solving the desired problem on square targets running in O(d(u)d(v) log(d_max)) where e=(u,v), d() denotes the node degree and d_max denotes the maximum degree of a node.

Fixing the same edge for both such algorithms, and the target to be a square graph, I find that both algorithms report in output the same subgraphs (except that Glasgow reports each subgraph twice), so the procedure I developed is correct, but surprisingly even on small graphs I find the procedure based on the Glasgow solver to be much more time-inefficient w.r.t. to the simple hand-made one. I was expecting Glasgow to have a slightly higher running time given that it addresses the problem in a much more general way, but the differences are very big, for example on a graph with 44090 nodes and 52222 edges Glasgow takes approximately 90 seconds to output an answer vs the hand-made algorithm taking at most 20 milliseconds. The same happens on a graph with 45813 nodes and 183412 edges where Glasgow takes 100 seconds vs 40 milliseconds, and on many other cases (targets and graphs).

I was wondering if there is a way to set the solver in order to take advantage of the much easier problem I am addressing, since Glasgow is computing more mappings than I need and if it possible to take advantage of the "locality" of the problem. Thanks for any possible advices.

ciaranm commented 3 years ago

The first thing I'd look at would be whether the solver is taking a long time because the instance is hard, or whether it's because it's spending ages doing preprocessing that doesn't really help. Can you try two things for me?

First, take a look at the "nodes" member of the HomomorphismResult struct returned by solve_homomorphism_problem. This tells you the number of recursive calls needed to solve the problem. If this number is roughly the same as the number of solutions, then the solver isn't exploring many dead ends, but if it's much bigger then the solver is exploring lots of infeasible partial solutions.

Second, try setting params.no_supplementals = true; and / or params.no_nds = true; . This will disable two kinds of preprocessing that can potentially be very expensive and not very helpful on large sparse target graphs.

lenksix commented 3 years ago

Thanks for the quick reply, looking at the number of "nodes" corresponding to the recursive calls employed by the solver I find it very close to the number of solutions in output so I guess not much time (additional to the one useful) is being spent in the exploration step.

I tried to set to the options you suggested and I find in general (over different graphs and targets) that setting them both to true leads to the best speedup factor, although on some datasets setting only params.no_supplementals = true, performs better. Now the overall time (of the experiments described in the first message) reduced from 90 to 5 seconds on the first dataset and from 100 to 6 seconds, which is a significant improvement!

Is there any other setting that I can change to try to improve further the running time? Since still the running time is distant nearly 3 orders of magnitude w.r.t. the other version I am testing against and I need to scale such approach (based on Glasgow) on massive datasets, with 10 or 100 million edges, thus any improvement could save much computational time. Thanks again!

ciaranm commented 3 years ago

The main issue is that the solver uses a bitset adjacency matrix representation for storing the target graph, so the memory requirement is proportional to the number of target vertices squared, rather than the number of target edges. This is the right choice for solving instances that are hard for computational rather than scale reasons, but it does mean in practice that you can't really use it for target graphs with more than maybe ten thousand vertices.

If the pattern graphs you're using are complex enough that being clever is potentially worthwhile, maybe you could call the solver many times through a decomposition approach? Do you know what the diameters of the graphs you have are, roughly? It could be that if your pattern graph has low diameter, then most of the target graph can be thrown away due to it being too far away from the fixed edge.

Long-term I'd like to implement a large sparse version of the solver that does a subset of the reasoning that can be done efficiently in terms of edges rather than vertices, but this will probably be a year's worth of work and I'd have to find funding for it...

lenksix commented 3 years ago

I understood the problem with the scalability on larger datasets. I will probably try to run a local visit (to the edge) and collect only the graph where the subgraphs I am interested in can occur, as you suggested (if I didn't understood it wrongly). Since the target often has a very small diameter like no more than 3, the network to be explored can diminuish substantially.

As a personal comment I find this repo very well done, many other subgraphs enumerators (which were published in the last years) do not release the code or do not provide adequate documentation, so it is a bit difficult to test them. Thanks!