GraphChi / graphchi-cpp

GraphChi's C++ version. Big Data - small machine.
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf
803 stars 312 forks source link

Implementing new graph file parsers for graphchi #1

Closed clstaudt closed 11 years ago

clstaudt commented 11 years ago

I would like to try graphchi with a collection of graphs in the so-called METIS format, a simple adjacency list format, which is however not the same as the adjacency lists already supported.

http://www.cc.gatech.edu/dimacs10/downloads.shtml

The Introduction to Example Applications states that "it is fairly easy to write your own parsers. ", but it is not apparent how this works. Looking at the source code did not get me far. There should be some hints in the documentation on how to create a new parser.

akyrola commented 11 years ago

The documentation is quite sparse, admittedly.

Here are some tips how to get started:

The parsers are implemented in src/preprocessing/conversions.hpp https://github.com/GraphChi/graphchi-cpp/blob/master/src/preprocessing/conversions.hpp

Look first around line 541 how the parsers for different tiletypes are called.

Then you need to write your own parser method similar to convert_adjlist(basefilename, sharderobj), which starts on line 285.

After that, you should be done. You just need to specify "--filetype=myfiletype" on the command line, where myfiletype is the identifier for the format your want to implement.

clstaudt commented 11 years ago

Thanks, this was very helpful. I think I might be able to implement a parser for the METIS format.

Is there a specific reason why the C++ standard library is so sparsely used for the parsers? Is it okay to work with std::ifstream, std::stringstream, std::getline etc?

Kind regards Christian Staudt

akyrola commented 11 years ago

It is ok to use C++ standard library. I just found the C-methods were a bit faster, and with billions of edges that can make a difference.

clstaudt commented 11 years ago

Implemented the METIS format parser, see:

https://algohub.iti.kit.edu/parco/Prototypes/PLPgraphchi/changeset/1aa8e6ef1373f1eceef31694b050bff4e91be3aa

However, when trying to test the new format with the community detection example, I get a crash:

cls ~/workspace/Prototypes/graphchi-cpp $ ./bin/example_apps/communitydetection --filetype=metis file /Users/cls/workspace/Data/DIMACS/Clustering/pgp.graph [filetype] => [metis] INFO: conversions.hpp(convert_if_notexists:742): Did not find preprocessed shards for /Users/cls/workspace/Data/DIMACS/Clustering/pgp.graph INFO: conversions.hpp(convert_if_notexists:744): (Edge-value size: 8) INFO: conversions.hpp(convert_if_notexists:745): Will try create them now... INFO: sharder.hpp(determine_number_of_shards:393): Determining number of shards automatically. INFO: sharder.hpp(determine_number_of_shards:396): Assuming available memory is 800 megabytes. INFO: sharder.hpp(determine_number_of_shards:397): (This can be defined with configuration parameter 'membudget_mb') INFO: sharder.hpp(determine_number_of_shards:403): Determining maximum shard size: 100 MB. INFO: sharder.hpp(determine_number_of_shards:416): Number of shards to be created: 2 INFO: sharder.hpp(execute_sharding:358): Max vertex id: 0 INFO: sharder.hpp(start_phase:488): Starting phase: 1 DEBUG: binary_adjacency_list.hpp(read_edges:133): 100% Assertion failed: (a>0), function preada, file ./src/util/ioutil.hpp, line 50. Abort trap: 6

This is hard for me to diagnose. Am I doing anything obviously wrong?

Kind regards Chris

Am 25.07.2013 um 18:20 schrieb Aapo Kyrola notifications@github.com:

It is ok to use C++ standard library. I just found the C-methods were a bit faster, and with billions of edges that can make a difference.

— Reply to this email directly or view it on GitHub.

akyrola commented 11 years ago

Sorry I had not noticed your message.

It seems your interim file is empty: see the message "Max vertex id: 0". You can send me the code and I am happy to have a look.

clstaudt commented 11 years ago

You should be able to view and pull the code from here: https://algohub.iti.kit.edu/parco/Prototypes/PLPgraphchi

Alternatively, I append the source file. Thank you for having a look at this.

Chris

Am 28.07.2013 um 04:02 schrieb Aapo Kyrola notifications@github.com:

Sorry I had not noticed your message.

It seems your interim file is empty: see the message "Max vertex id: 0". You can send me the code and I am happy to have a look.

— Reply to this email directly or view it on GitHub.

akyrola commented 11 years ago

Hmm, i notice that none of your output to logstream of convert_metis is shown.

I don't see anything obviously wrong in your code. I suggest you add std::cout << "debug ... " << std::endl; to many places and hunt down why no edges are read from the file.

clstaudt commented 11 years ago

Am 29.07.2013 um 19:06 schrieb Aapo Kyrola notifications@github.com:

I don't see anything obviously wrong in your code. I suggest you add std::cout << "debug ... " << std::endl; to many places and hunt down why no edges are read from the file.

No edges are read from the file because the control flow does not reach my convert_metis function, starting from the community detection example app. In the main function of the example, graphchi_init(argc, argv) is called, which is supposed to read the --filetype=metis option I suppose. Then it calls set_argc and puts the key-value-pair into the configuration, and prints it, right? get_option_string_interactive is supposed to get the value, I guess. I cannot figure out where convert is actually called, the example only calls convert_if_notexists explicitly. Any idea on how to fix this?

akyrola commented 11 years ago

convert_if_notexists calls convert.... what's happening there?

clstaudt commented 11 years ago

For some reason, it did not enter the if (!sharderobj.preprocessed_file_exists()) block. Tried it with a new file and now reading a graph in METIS format seems to work. Community detection on a large web graph 1 runs ins 269 seconds.

Are you interested in adding the parser code to graphchi?

Kind regards Chris

Am 03.08.2013 um 19:49 schrieb Aapo Kyrola notifications@github.com:

convert_if_notexists calls convert.... what's happening there?

— Reply to this email directly or view it on GitHub.

akyrola commented 11 years ago

Great! Just make a pull request and I will add it. Thanks!

Sent from my iPhone

On Aug 4, 2013, at 14:31, clstaudt notifications@github.com wrote:

For some reason, it did not enter the if (!sharderobj.preprocessed_file_exists()) block. Tried it with a new file and now reading a graph in METIS format seems to work. Community detection on a large web graph 1 runs ins 269 seconds.

Are you interested in adding the parser code to graphchi?

Kind regards Chris

Am 03.08.2013 um 19:49 schrieb Aapo Kyrola notifications@github.com:

convert_if_notexists calls convert.... what's happening there?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.