Jokeren / gBolt

gBolt--very fast implementation for gSpan algorithm in data mining
BSD 2-Clause "Simplified" License
51 stars 14 forks source link

Program terminates abnormally for certain frequency threshold #13

Closed parnikadamle closed 7 years ago

parnikadamle commented 8 years ago

Hi Keren,

For the mushroom data available on uci machine learning repository, we have obtained the graph representation. Each record is now a graph with vertices and edges.

When I run your gspan code for this input with support=0.3 it runs fine. But when I run it with lesser support say 0.2 it terminates abnormally. It gives following error:-

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

I also tried running with the same input, support=0.2 and single thread but it gives same error. I am attaching the input file I have used.

Can you please take a look in this issue and let me know accordingly?

mushroom_data_train_data.zip

Appreciate your help in advance.

Jokeren commented 8 years ago

No problem, I would take a look at your data.

parnikadamle commented 8 years ago

Hi Keren,

Thanks a lot :) On Feb 13, 2016 17:00, "Keren Zhou" notifications@github.com wrote:

No problem, I would take a look at your data.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-183647107 .

Jokeren commented 8 years ago

Well, after running the program, I notice that the memory consumption is abnormally high.

Therefore I suspect that under the frequency of 0.2, a considerable number of frequent graphs are generated. I intend to revise the code and verify the place where the memory is released. In the way the program finally could be terminated in spite of long running time.

By the way, 1 year after I wrote the code, I realize plenty of shortcomings in the original version. Maybe I would refractory the whole project later.

parnikadamle commented 8 years ago

Hi Keren,

Thanks a lot for the excellent analysis. Yeah, memory is the issue. Please let me know once you fix the issue. On Feb 14, 2016 07:21, "Keren Zhou" notifications@github.com wrote:

Well, after running the program, I notice that the memory consumption is abnormally high.

Therefore I suspect that under the frequency of 0.2, a considerable number of frequent graphs are generated. I intend to revise the code and verify the place where the memory is released. In the way the program finally could be terminated in spite of long running time.

By the way, 1 year after I wrote the code, I realize plenty of shortcomings in the original version. Maybe I would refractory the whole project later.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-183791584 .

Jokeren commented 8 years ago

Hi @parnika26 , have you tried the original version proposed by Yan?

https://www.cs.ucsb.edu/~xyan/software/gSpan.htm

Setting the minimum support as 0.2, the software(gSpan-64) cannot terminate in several minutes, and the memory consumption is extremely high at 8 threads.

Can you run the program and tell me whether it could terminate on your machine?

parnikadamle commented 8 years ago

Hi Keren,

I tried running the gSpan-64 by Yan for the same input with support=0.2. But the program got killed by the system in few minutes because of high memory consumption.

Then I tried running gSpan-32 by Yan. The program is running since 19 hours and has not been terminated yet. Already 23 GB of data has been generated by the program till now. But the program is not killed yet.

Please let me know if you get some clues for this issue.

On Mon, Feb 15, 2016 at 2:29 PM, Keren Zhou notifications@github.com wrote:

Hi @parnika26 https://github.com/parnika26 , have you tried the original version proposed by Yan?

https://www.cs.ucsb.edu/~xyan/software/gSpan.htm

Setting the minimum support as 0.2, the software(gSpan-64) cannot terminate in several minutes, and the memory consumption is extremely high at 8 threads.

Can you run the program and tell me whether it could terminate on your machine?

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-184120849 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

When I was writing the program, I had compared running results of my program with that of gSpan-32 for different data sets. I still remember there's little difference.

Both gSpans (gSpan-32&gSpan-64) by Yan are writing results consecutively onto the disk, so I could check part of the output files to determine where is the problem.

parnikadamle commented 8 years ago

Hi Keren,

I even tried running your code for another data set 'Compound_422' with support=0.05 and got the same error

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

But when I ran gSpan-32 by Yan for the same input and support=0.05 the program did terminate successfully and the execution time was 1475.59 sec.

Looking at above analysis, I think the memory consumption is high and the memory is not getting released at regular intervals in your code. Hence, the program gets aborted.

Can you please take a look into this issue and let me know accordingly ?

Appreciate your help in advance.

On Tue, Feb 16, 2016 at 1:20 PM, Keren Zhou notifications@github.com wrote:

When I was writing the program, I had compared running results of my program with that of gSpan-32 for different data sets. I still remember there's little difference.

Both gSpans (gSpan-32&gSpan-64) by Yan are writing results consecutively onto the disk, so I could check part of the output files to determine where is the problem.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-184563818 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

Please try the current refine branch. It has following features:

  1. new thread split strategy (not verified). you can changed it to the original version by configuring the Makefile.
  2. add mushroom data
  3. thread terminate logs
  4. reduce memory consumption.

I have tested the Compound_422 data with support=0.05 under 8 threads, and it terminates in 670 sec.

Actually, I think the initial configuration of thread split is not enough to reach a load balance. Latter on we have to push frequent patterns into a queue so that each thread can fetch their tasks from it.

parnikadamle commented 8 years ago

Hi Keren,

Thanks a lot for the help so far.

I tried running the refined code for Compound_422 data with support=0.05 under 4 threads and it successfully terminates. But when I ran the refined code for mushroom data with support=0.2 under 4 threads I got the same issue after running for some time.

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

Can you please try to run the code for mushroom data set with support 0.2 or less and let me know if it terminates on your machine?

My machine specifications are as follows:-

RAM:- 4 GB OS:- 32 bit ubuntu 12.04 LTS Processor:- Core 2 Duo 3.16 GHz

On Wed, Feb 17, 2016 at 1:32 PM, Keren Zhou notifications@github.com wrote:

Please try the current refine branch. It has following features:

  1. new thread split strategy (not verified). you can changed it to the original version by configuring the Makefile.
  2. add mushroom data
  3. thread terminate logs
  4. reduce memory consumption.

I have tested the Compound_422 data with support=0.05 under 8 threads, and it terminates in 670 sec.

Actually, I think the initial configuration of thread split is not enough to reach a load balance. Latter on we have to push frequent patterns into a queue so that each thread can fetch their tasks from it.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-185086615 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

No problem, but at first I wonder whether your program with gspan-32 running mushroom data has terminated? I remember you said it has run for days?

parnikadamle commented 8 years ago

Hi Keren,

No. The program gspan-32 by Yan did not terminate on its own. I kept it running for approximately 2 days but it did not terminate. It generated almost 81 GB of output and program consumed lot of memory due to which system got very very slow.

And your newer version of code got terminated due to same memory issue within few minutes for mushroom data at support=0.2 and 4 threads on my machine.

So, can you please check the behavior of your modified code on your machine for mushroom data at 0.2 support and let me know the result?

Also, is there any way to deal with this memory issue for big data like mushroom and lesser threshold?

Appreciate your help in advance.

On Thu, Feb 18, 2016 at 3:22 PM, Keren Zhou notifications@github.com wrote:

No problem, but at first I wonder whether your program with gspan-32 running mushroom data has terminated? I remember you said it has run for days?

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-185632594 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

Thank you for your reply!

Based on your observation, I think my program cannot terminate no matter what optimization have been done. In contrast to Yan's programs, my gspan program writes out results after the termination of whole mining process. In this way the memory consumption will be high until the results are released.

To solve the problem, maybe I can change the method of writing results. But there's still another deal left, my server's disk doesn't have enough space to store more 81 GB data. So I couldn't test it by myself.

With respect to your last question, the memory issue is easy to handle as I mentioned before. However, since the threshold is very low, it must generate a considerable number of patterns and lead to long mining process. Given your case, the problem is caused by the data, not the programs--gspan-32, gspan-64, and my gspan.

Please feel free to let me know if there should be any question. :)

Jokeren commented 8 years ago

Suddenly I come up with another solution. Maybe there are more efficient graph mining algorithms? Well, I am not specialized in the research area.

parnikadamle commented 8 years ago

Hi Keren,

There is a survey paper based on different graph mining algorithms. So, as per that paper gspan seems reasonable to me.

Huge data is one of the issues. But we need to work with such data. So, can you please give a try by modifying the method to write the results to disk along with graph mining process like Yan's approach instead of writing them after termination.

I can then test the new changes for mushroom data on my machine.

Appreciate your help in advance.

On Fri, Feb 19, 2016 at 9:38 PM, Keren Zhou notifications@github.com wrote:

Suddenly I come up with another solution. Maybe there are more efficient graph mining algorithms? Well, I am not specialized in the research area.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-186279363 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

OK, I will update the corresponding feature, but please wait for days. My holiday is over and I have to return to work.

parnikadamle commented 8 years ago

Hi Keren,

Sure, I will wait. Please let me know once you done with the changes.

Thank you so much for being so much helpful :)

On Mon, Feb 22, 2016 at 7:08 AM, Keren Zhou notifications@github.com wrote:

OK, I will update the corresponding feature, but please wait for days. My holiday is over and I have to return to work.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-186965186 .

Regards Parnika Paranjape

parnikadamle commented 8 years ago

Hi Keren,

Just a gentle reminder. Any update on the corresponding feature that we discussed in previous mails?

Appreciate your help in advance.

On Mon, Feb 22, 2016 at 11:34 AM, parnika paranjape < parnikaparanjape@gmail.com> wrote:

Hi Keren,

Sure, I will wait. Please let me know once you done with the changes.

Thank you so much for being so much helpful :)

On Mon, Feb 22, 2016 at 7:08 AM, Keren Zhou notifications@github.com wrote:

OK, I will update the corresponding feature, but please wait for days. My holiday is over and I have to return to work.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-186965186 .

Regards Parnika Paranjape

Regards Parnika Paranjape

Jokeren commented 8 years ago

Sorry, I am still busy dealing with assignments offered by my mentor. I must finish them first, or he will not be satisfied.

Therefore, I am so sorry that I can't guarantee when could I update the corresponding feature.

parnikadamle commented 8 years ago

Hi Keren,

Thanks for the update.

Sorry for the trouble but I intend to use your code for graph mining in my work. So, if you could give the desired modifications then it will be really helpful in my work.

Can you please try to modify the corresponding feature as soon as you done with your assignments?

Appreciate your help in advance.

On Thu, Mar 3, 2016 at 6:35 AM, Keren Zhou notifications@github.com wrote:

Sorry, I am still busy dealing with assignments offered by my mentor. I must finish them first, or he will not be satisfied.

Therefore, I am so sorry that I can't guarantee when could I update the corresponding feature.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-191521995 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

Thank's OK, I am trying to find a balance between tasks given by the mentor and my own activities.

parnikadamle commented 8 years ago

Hi Keren,

Sure, no problem. Please let me know when you are done with the corresponding feature.

Thanks a lot again. On Mar 4, 2016 5:34 AM, "Keren Zhou" notifications@github.com wrote:

Thank's OK, I am trying to find a balance between tasks given by the mentor and my own activities.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-192028879 .

parnikadamle commented 8 years ago

Hi Keren,

How are you and your work? Just a gentle reminder. Did you get any chance to look into the issue and modify the code?

Please let me know since, my work has got stuck on this issue.

I understand that you have lot of work but can you please try to look into the issue as soon as possible?

Appreciate your help in advance.

On Fri, Mar 4, 2016 at 7:59 AM, parnika paranjape < parnikaparanjape@gmail.com> wrote:

Hi Keren,

Sure, no problem. Please let me know when you are done with the corresponding feature.

Thanks a lot again. On Mar 4, 2016 5:34 AM, "Keren Zhou" notifications@github.com wrote:

Thank's OK, I am trying to find a balance between tasks given by the mentor and my own activities.

— Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-192028879 .

Regards Parnika Paranjape

Jokeren commented 8 years ago

I would try my best, but can you tell me your deadline?

Jokeren commented 8 years ago

@parnika26 see develop branch. In this version, I set up a number to denote the maximum number of patterns remain in the memory. After the frequent patterns exceeding the number, it holds a global lock, and writes patterns to the disk.

By the way, I am really unsure whether the result is correct or not without any test.

parnikadamle commented 8 years ago

Hi Keren,

Sorry for the late response. I was not well for a long period so, replying late. I tested your developed branch code. As compared to previous version, current version consumes less memory.

But sometimes the code gives split thread error 'split thread error! size 1336, thread_size 4' for certain support values. I got this error when I ran the code for given data set at support >= 0.2. When support < 0.2, the code does not give this error.

Do you know the cause for this error and how to fix it? Please find attached the data set.

On Mon, Mar 21, 2016 at 9:00 AM, Keren Zhou notifications@github.com wrote:

@parnika26 https://github.com/parnika26 see develop branch. In this version, I set up a number to denote the maximum number of patterns remain in the memory. After the frequent patterns exceeding the number, it holds a global lock, and writes patterns to the disk.

By the way, I am really unsure whether the result is correct or not without any test.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-199097561

Regards Parnika Paranjape

Jokeren commented 8 years ago

@parnika26 , according to the code, this is because there are less than 4 initial sets that have support >= 0.2.

For example, suppose we have 'A', 'B', 'C', 'D' as initial codes, but the number of all codes start with 'D' is less than 0.2 * total. In this way we only start with 'A', 'B' and 'C'.

Hence, to run the program with given dataset, we could set support < 0.2 such that all codes satisfy the condition. Or we could set thread_num less than 4.

parnikadamle commented 8 years ago

Hi Keren,

Thanks a lot for the quick response. I will then try to tun with less support or less number of threads. Thanks once again for your support so far :)

On Tue, Apr 19, 2016 at 8:26 PM, Keren Zhou notifications@github.com wrote:

@parnika26 https://github.com/parnika26 , according to the code, this is because there are less than 4 initial set that has support >= 0.2.

For example, suppose we have 'A', 'B', 'C', 'D' as the initial code, but the number of all codes start with 'D' is less than 0.2 * total. In this way we only start with 'A', 'B' and 'C'.

Hence, if we set support < 0.2, all codes satisfy the condition. Or we could set thread_num less than 4.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Jokeren/DataMining-gSpan/issues/13#issuecomment-211964222

Regards Parnika Paranjape

Jokeren commented 8 years ago

@parnika26 , You are welcome! I will keep on maintaining this project. :)