Jokeren / gBolt

gBolt--very fast implementation for gSpan algorithm in data mining
BSD 2-Clause "Simplified" License
52 stars 14 forks source link

Does gbolt support input files larger than 5GB? #21

Closed VOID001 closed 5 years ago

VOID001 commented 5 years ago

Hello Jokeren, I am helping my friends use gbolt with a 5GB dataset, however it keeps gives the segmentation fault and each time the dmesg seems crash on different part, I am sorry but I cannot provide the core file this time. And I am not familiar with data mining, so I've go through the issues and found that if my dataset is in wrong format then the program may crash down. However I verify the dataset with a Python script (I will attach below) and I don't found any wholes in the graph, all the vertex id is continuous and data set number is also consistent, the data set is huge (5GB) so I think I cannot upload it here, would you please give me a hint on why this problem may happen? Some useful log entries:

the argument I use is -support 0.6 the server has 80 cpu threads and it use the default option to run Thanks in advance

#!/bin/env python

if __name__ == "__main__":
    f = open("/path/to/dataset.txt", "r")
    prev = 0
    prevline = ""
    dataSetStart = False
    for line in f:
        if line.startswith("t #"):
            dataSetStart = True
            print("Dataset {}".format(line))
            continue
        if line.startswith("v "): # It's a vertex data
            vertex_id = int(line.split(" ")[1])
            label = int(line.split(" ")[2])
            if label < 0 or vertex_id < 0:
                    print("Found invalid data:\n{}\n{}".format(prevline, line))
            # print("line={}\nVertex id = {} prev = {}".format(line, vertex_id, prev))
            if dataSetStart:
                dataSetStart = False
                prev = vertex_id
                prevline = line
                # raise Exception("WTF START prev = {}, vertex_id = {}".format(prev, vertex_id))
                continue
            if not dataSetStart:
                if prev + 1 != vertex_id:
                    print("Found invalid data:\n{}\n{}".format(prevline, line))
            prev = vertex_id
            prevline = line
        if line.startswith("e "):
            arr = line.split(" ")
            v1 = int(arr[1])
            v2 = int(arr[2])
            lbl = int(arr[3])
            if v1 < 0 or v2 < 0 or lbl < 0:
                print("Found invalid data:\n{}\n{}".format(prevline, line))
VOID001 commented 5 years ago

By the way I build the program on the master branch

Jokeren commented 5 years ago

By the way I build the program on the master branch

The amount of memory used depends on the data. For some cases, you can think of the peaking memory consumption more than 200x comparing to the original size.

I am currently in the process of rolling out a new version which is faster and memory efficient, but it won't be available until summer. As graph mining is not my research focus, I really do not have free time slots right now.

I am trying to schedule a phone call with your partner.

Jokeren commented 5 years ago

Hello Jokeren, I am helping my friends use gbolt with a 5GB dataset, however it keeps gives the segmentation fault and each time the dmesg seems crash on different part, I am sorry but I cannot provide the core file this time. And I am not familiar with data mining, so I've go through the issues and found that if my dataset is in wrong format then the program may crash down. However I verify the dataset with a Python script (I will attach below) and I don't found any wholes in the graph, all the vertex id is continuous and data set number is also consistent, the data set is huge (5GB) so I think I cannot upload it here, would you please give me a hint on why this problem may happen? Some useful log entries:

https://cfp.vim-cn.com/cbfv6/bash#n-496 https://cfp.vim-cn.com/cbfv6/bash#n-455

the argument I use is -support 0.6 the server has 80 cpu threads and it use the default option to run Thanks in advance

!/bin/env python

if name == "main": f = open("/path/to/dataset.txt", "r") prev = 0 prevline = "" dataSetStart = False for line in f: if line.startswith("t #"): dataSetStart = True print("Dataset {}".format(line)) continue if line.startswith("v "): # It's a vertex data vertex_id = int(line.split(" ")[1]) label = int(line.split(" ")[2]) if label < 0 or vertex_id < 0: print("Found invalid data:\n{}\n{}".format(prevline, line))

print("line={}\nVertex id = {} prev = {}".format(line, vertex_id, prev))

        if dataSetStart:
            dataSetStart = False
            prev = vertex_id
            prevline = line
            # raise Exception("WTF START prev = {}, vertex_id = {}".format(prev, vertex_id))
            continue
        if not dataSetStart:
            if prev + 1 != vertex_id:
                print("Found invalid data:\n{}\n{}".format(prevline, line))
        prev = vertex_id
        prevline = line
    if line.startswith("e "):
        arr = line.split(" ")
        v1 = int(arr[1])
        v2 = int(arr[2])
        lbl = int(arr[3])
        if v1 < 0 or v2 < 0 or lbl < 0:
            print("Found invalid data:\n{}\n{}".format(prevline, line))

Sometimes memory error occurs not because of insufficient memory, but it's actually stack memory overflows due to recursive function calls as some data structures are allocated on stack.

VOID001 commented 5 years ago

By the way I build the program on the master branch

The amount of memory used depends on the data. For some cases, you can think of the peaking memory consumption more than 200x comparing to the original size.

I am currently in the process of rolling out a new version which is faster and memory efficient, but it won't be available until summer. As graph mining is not my research focus, I really do not have free time slots right now.

I am trying to schedule a phone call with your partner.

Thanks for the quick response! Yeah, I found him got OOM on a 1TB machine when runnning the program (on a 40GB dataset :O) So maybe a 5GB dataset is not able to pass the memory consumption requirment :P

VOID001 commented 5 years ago

Hello Jokeren, I am helping my friends use gbolt with a 5GB dataset, however it keeps gives the segmentation fault and each time the dmesg seems crash on different part, I am sorry but I cannot provide the core file this time. And I am not familiar with data mining, so I've go through the issues and found that if my dataset is in wrong format then the program may crash down. However I verify the dataset with a Python script (I will attach below) and I don't found any wholes in the graph, all the vertex id is continuous and data set number is also consistent, the data set is huge (5GB) so I think I cannot upload it here, would you please give me a hint on why this problem may happen? Some useful log entries: https://cfp.vim-cn.com/cbfv6/bash#n-496 https://cfp.vim-cn.com/cbfv6/bash#n-455 the argument I use is -support 0.6 the server has 80 cpu threads and it use the default option to run Thanks in advance

!/bin/env python

if name == "main": f = open("/path/to/dataset.txt", "r") prev = 0 prevline = "" dataSetStart = False for line in f: if line.startswith("t #"): dataSetStart = True print("Dataset {}".format(line)) continue if line.startswith("v "): # It's a vertex data vertex_id = int(line.split(" ")[1]) label = int(line.split(" ")[2]) if label < 0 or vertex_id < 0: print("Found invalid data:\n{}\n{}".format(prevline, line))

print("line={}\nVertex id = {} prev = {}".format(line, vertex_id, prev))

if dataSetStart: dataSetStart = False prev = vertex_id prevline = line

raise Exception("WTF START prev = {}, vertex_id = {}".format(prev, vertex_id))

continue if not dataSetStart: if prev + 1 != vertex_id: print("Found invalid data:\n{}\n{}".format(prevline, line)) prev = vertex_id prevline = line if line.startswith("e "): arr = line.split(" ") v1 = int(arr[1]) v2 = int(arr[2]) lbl = int(arr[3]) if v1 < 0 or v2 < 0 or lbl < 0: print("Found invalid data:\n{}\n{}".format(prevline, line))

Sometimes memory error occurs not because of insufficient memory, but it's actually stack memory overflows due to recursive function calls as some data structures are allocated on stack.

Yeah, I will try to increase the default stack size, and try to get a core file to look into first. Thanks for your response :)

VOID001 commented 5 years ago

Should I use -O0 to debug the stack footprint? or I could just use -O3 and see the stack backtrace? Will -O3 contains possible flaws (optimization, etc)?

Jokeren commented 5 years ago

Should I use -O0 to debug the stack footprint? or I could just use -O3 and see the stack backtrace? Will -O3 contains possible flaws (optimization, etc)?

-O3 -g or -O2 -g or CMAKE_BUILD_TYPE Debug, whatever you want.

It's possible that compiler optimizations may affect the correctness, but I don't think it applies to my code.

It's probably better to set the number of threads to be one while you are debugging.

Jokeren commented 5 years ago

If the data are not credential, you guys can upload them somewhere on the internet so that I can test my new branch. I haven't tried any real world data yet, though Facebook guys told me gBolt ran correctly and efficiently in their circumstance. With the help of your data, we can work together to make gBolt a better tool.

I probably know where the problems are, and I know exactly the solution to these problems. But again, I do now have time to fix them now.

Sorry for troubling you. If you find it's really hard to debug, please try to use other tools. I will let you know after releasing gBolt 1.0 and getting correct results on your data.

VOID001 commented 5 years ago

Thank you , I am trying to debug it such as detecting the stack depth and try to make a auto increasing stack, I will contact him if he could provide the dataset. If I cannot make it, I might go and ask if you could help, I am very appreciated your quick reply although you are not currently working on this project. And I also recommend my friend to have a Skype voice chat with you. Wish the project could be better :D