Jokeren / gBolt

gBolt--very fast implementation for gSpan algorithm in data mining
BSD 2-Clause "Simplified" License
52 stars 14 forks source link

Do vertices need to be sequentially ordered in input file? (possible bug) #20

Closed andrewguy closed 5 years ago

andrewguy commented 5 years ago

Hi Jokeren,

Love your work with this tool, it's been very useful so far.

I've come across what could be a bug. If the nodes don't appear sequentially in the input file, then some subgraphs aren't identified when they should be. Is there a requirement for all nodes to appear sequentially (according to the node id) within the input file? If so, this should probably be documented somewhere.

I've attached two small example input files, one with nodes ordered properly, one without. They yield quite different output files (also attached). The attached output files were generated with /path/to/gbolt -input_file ./input.lg -support 0.8 -output_file ./output.lg -parent true -nodes true -pattern true.

Cheers, Andrew

example_files.zip

Jokeren commented 5 years ago

“Is there a requirement for all nodes to appear sequentially (according to the node id) within the input file?”

Yes, they should be sequentially ordered without holes. The graph structure uses an array (vector) to store all the nodes and take indices of the array to fetch nodes.

You are right, it's better to document this specification somewhere. I do not want to add implementations to support reading different input formats which may reduce performance somehow. At the beginning of this project, I just followed Yan's input format and did not support even graph ids with holes...

Jokeren commented 5 years ago

Now, I am considering to rollback my graph id with holes feature and instead add an input format specification. I have seen many people getting problems with the input, and I do not have time to adjust implementations for everyone.

Thanks for your feedback!

andrewguy commented 5 years ago

Great to know, thanks. I agree, proper input format specification is all that is needed. If people want to deal with other input formats (such as myself), then a small wrapper script will do the job.

I have to admit, the lack of input format specification did put me off using gBolt initially - it was only after using Yan's tool that I came back to this.

One other question - is there any similar requirement for the ordering of the edges? Or can these be ordered randomly?

Jokeren commented 5 years ago

One other question - is there any similar requirement for the ordering of the edges? Or can these be ordered randomly?

You just need to denote all the edges after all the vertices of the same graph.

andrewguy commented 5 years ago

You just need to denote all the edges after all the vertices of the same graph.

Great, that's what I've been doing.

I'm closing this issue as I think it's been adequately resolved - I just need to make sure all input nodes are sequential and without holes when writing input files.