GraphChi / graphchi-cpp

GraphChi's C++ version. Big Data - small machine.
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf
803 stars 312 forks source link

Problems creating large shard file (> 2GiB) ? #6

Closed cryptomeme closed 11 years ago

cryptomeme commented 11 years ago

It appears that GraphChi 0.2.3 has trouble creating shard file larger than 2GiB? I have plenty of space to put it, but it errors out. Any additional details I can provide?

Reducing the memory budget sufficiently (so that < 2 GiB shards are created) works around the issue.

Running GraphChi Connected Components program
INFO:     conversions.hpp(convert_if_notexists:767): Did not find preprocessed shards for nodes_20131001
INFO:     conversions.hpp(convert_if_notexists:768): (Edge-value size: 4)
INFO:     conversions.hpp(convert_if_notexists:769): Will try create them now...
INFO:     sharder.hpp(start_preprocessing:326): Starting preprocessing, shovel size: 1310720000
INFO:     conversions.hpp(convert_edgelist:221): Reading in edge list format!
DEBUG:    conversions.hpp(convert_edgelist:226): Read 10000000 lines, 180.039 MB
.
.
.
DEBUG:    conversions.hpp(convert_edgelist:226): Read 1590000000 lines, 29900.5 MB
INFO:     sharder.hpp(flush:152): Sorting shovel: nodes_201310014.1.shovel, max:1738496712
INFO:     sharder.hpp(flush:154): Sort done.nodes_201310014.1.shovel
ERROR:    ioutil.hpp(writea:129): Could not write 3435929520 bytes! error:Bad file descriptor
connectedcomponents_list: ./src/util/ioutil.hpp:130: void writea(int, T*, size_t) [with T = graphchi::edge_with_value<unsigned int>]: Assertion `false' failed.
akyrola commented 11 years ago

Hi Damon,

thanks for the report. Which OS are you using? I am wondering if your filesystem supports that big files.

Aapo

On Sep 19, 2013, at 10:30 AM, Damon Buckwalter notifications@github.com wrote:

It appears that GraphChi 0.2.3 has trouble creating shard file larger than 2GiB? I have plenty of space to put it, but it errors out. Any additional details I can provide?

Reducing the memory budget sufficiently (so that < 2 GiB shards are created) works around the issue.

Running GraphChi Connected Components program INFO: conversions.hpp(convert_if_notexists:767): Did not find preprocessed shards for nodes_20131001 INFO: conversions.hpp(convert_if_notexists:768): (Edge-value size: 4) INFO: conversions.hpp(convert_if_notexists:769): Will try create them now... INFO: sharder.hpp(start_preprocessing:326): Starting preprocessing, shovel size: 1310720000 INFO: conversions.hpp(convert_edgelist:221): Reading in edge list format! DEBUG: conversions.hpp(convert_edgelist:226): Read 10000000 lines, 180.039 MB . . . DEBUG: conversions.hpp(convert_edgelist:226): Read 1590000000 lines, 29900.5 MB INFO: sharder.hpp(flush:152): Sorting shovel: nodes_201310014.1.shovel, max:1738496712 INFO: sharder.hpp(flush:154): Sort done.nodes_201310014.1.shovel ERROR: ioutil.hpp(writea:129): Could not write 3435929520 bytes! error:Bad file descriptor connectedcomponents_list: ./src/util/ioutil.hpp:130: void writea(int, T*, size_t) [with T = graphchi::edge_with_value]: Assertion `false' failed. — Reply to this email directly or view it on GitHub.

Aapo Kyrola Ph.D. student, http://www.cs.cmu.edu/~akyrola GraphChi: Big Data - small machine: http://graphchi.org twitter: @kyrpov

cryptomeme commented 11 years ago

I'm using CentOS 6.4 and kernel 2.6.32-358.el6.x86_64

My input file is 31GiB, so I would expect that large files are ok? And in fact, now that I look GraphChi is emitting other files > 2GiB so I may have jumped to conclusions...

Let me make sure I can reproduce the problem and I will get back to you.

BTW, thanks for putting GraphChi out there! It's been a crucial tool for me to do connected components analysis. Even though the computational aspects are a bit 'magical' to me still, it does the job that other approaches can't with the scale of data I'm working with.

If you're ever in PDX, I owe you a few beers at least!

akyrola commented 11 years ago

By the way, for connected components analysis, if you can fit O(V) in your memory, you should use the new unionfind_connectedcomponents. It is MUCH faster as it requires only one pass. (Actually you won't need GraphChi for union-find, just a simple pass over the edges suffices).

Aapo

On Sep 19, 2013, at 12:21 PM, Damon Buckwalter notifications@github.com wrote:

I'm using CentOS 6.4 and kernel 2.6.32-358.el6.x86_64

My input file is 31GiB, so I would expect that large files are ok? And in fact, now that I look GraphChi is emitting other files > 2GiB so I may have jumped to conclusions...

Let me make sure I can reproduce the problem and I will get back to you.

BTW, thanks for putting GraphChi out there! It's been a crucial tool for me to do connected components analysis. Even though the computational aspects are a bit 'magical' to me still, it does the job that other approaches can't with the scale of data I'm working with.

If you're ever in PDX, I owe you a few beers at least!

— Reply to this email directly or view it on GitHub.

Aapo Kyrola Ph.D. student, http://www.cs.cmu.edu/~akyrola GraphChi: Big Data - small machine: http://graphchi.org twitter: @kyrpov