Open nruemmele opened 5 years ago
Hi @nruemmele! Thank you for your requests. I will try to explain why we design some features and why its not possible to include some of the proposed requests.
A Flink job is designed to be executed from source to sink as whole. To be able to persist intermediate results is a (commonly wished) feature but actual hard to realize. For example if you design a flink job with two sinks, the job will be executed twice. To persist the results after each iteration would mean, that we create a standalone flink job for each iteration in the mining process. The overall runtime would increase due to planning, scheduling and communication before and after each job + the time needed to calculate the step itself. We would like to provide a feature like that, but we are bounded due to limitations of the flink batch api.
You're right. At the beginning of the job the calculated 1-edge pattern are balanced evenly across the worker of the cluster. After that the growing, counting and filter steps are starting. Since some branches of the search tree contain no frequent children, some of the workers are entering a idle mode. This behavior occurs because the child's of once frequent parent patterns will be calculated on the same machine without communication to other instances in the cluster. Flink's rebalance
won't have any kind of impact because there are not enough working branches left to be executed in parallel. We face this problem in several algorithms in a distributed environment e.g Graph Grouping.
Well, i'm glad i can help you with that propose. I will create an issue about that. Two options here: Either i take your changes and create a pull request myself or you open a pull request and we give you some information about how you can fix your style. Like this you would become official contributor to Gradoop which is way cooler i think :)
If there are some more feature requests about DIMSpan or other parts of Gradoop feel free to tell us. We try to enhance the usage and functionality of Gradoop so your opinion means a lot.
I'm using often your implementation for frequent graph pattern mining, and I would appreciate the following extensions/improvements to your current implementation:
Persist intermediate results. At the moment I have to wait till the whole process is finished (either maxIterations is hit or the search space is empty). It would be great to persist intermediate results after each iteration is finished. Additionally, the persisted results could be used to recover in case of a failure. One could persist intermediate results in the form of a search tree: each node represents a pattern with its DFS code, the depth of the search tree specifies the iteration of the mining process, nodes are connected if one pattern is grown from the other. In the presence of the partial search tree, one could initialize the mining process not from the singletons, but rather from the persisted finished iteration of the search tree.
Re-balance workload after each iteration. I observe currently that the CPU usage decreases substantially with time. For example, the mining process starts with 98-99% CPU usage, but after 10 hours I observe that less than 20% is used. I haven't debugged it and don't know the internals of Flink well enough, but my guess is that the workload is distributed at the first iteration when 1-edge patterns are constructed, and with time when more and more of grown patterns become obsolete the dedicated workers to the corresponding branch of the search tree become idle. I tried to use Flink's rebalance method on some of the intermediate datasets in the mine method, but apparently it is not so straightforward...
User-provided maxSize/maxIterations to speed up the mining process. At the moment it is fixed as a constant in the code
private static final int MAX_ITERATIONS = 100;
I've introduced this feature on my fork, but haven't created a merge request since I broke your style and did not manage to fix it :)