discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

ddfs chunk fails on import #557

Open rcarmo opened 10 years ago

rcarmo commented 10 years ago

I tried to import a variable-width data file via stdin to ddfs chunk, which failed with the following message:

Traceback (most recent call last):
  File "/usr/bin/ddfs", line 437, in <module>
    DDFS(option_parser=OptionParser()).main()
  File "/usr/local/lib/python2.7/dist-packages/clx/__init__.py", line 168, in main
    return self.dispatch()
  File "/usr/local/lib/python2.7/dist-packages/clx/__init__.py", line 164, in dispatch
    self.cmd(self, *self.argv)
  File "/usr/local/lib/python2.7/dist-packages/clx/__init__.py", line 88, in __call__
    return self.function(program, *args)
  File "/usr/bin/ddfs", line 158, in chunk
    update=program.options.update)
  File "/usr/lib/python2.7/dist-packages/disco/ddfs.py", line 153, in chunk
    for n, chunk in enumerate(chunk_iter(reps))]
  File "/usr/lib/python2.7/dist-packages/disco/fileutils.py", line 42, in chunks
    out.append(record)
  File "/usr/lib/python2.7/dist-packages/disco/fileutils.py", line 83, in append
    self.hunk_write(pickle_dumps(record, 1))
  File "/usr/lib/python2.7/dist-packages/disco/fileutils.py", line 114, in hunk_write
    " is larger than max_record_size: " + str(self.max_record_size))
ValueError: Record of size 1262908 is larger than max_record_size: 1048576

The file format is essentially a set of UUIDs separated by commas, with a variable number of columns per record.

It's fairly large, so it's hard to pinpoint exactly why this is failing, but I don't think it's due to a line exceeding one MB.

jobs@master:~$ zcat /srv/jobs/dataset.gz | wc
136686074 136686074 45113278519

Any ideas?

pooya commented 10 years ago

@rcarmo It seems like disco is not detecting the newlines, trying to put everything as a single record and failing because of its large size. Which command did you use for pushing data into ddfs?

pooya commented 10 years ago

Also, just to make sure the lines are not too big, would you please run the following command? $ for i in $(zcat /srv/jobs/dataset.gz); do echo ${#i} >> /tmp/disco_tmp_sizes; done $ sort /tmp/disco_tmp_sizes | tail

Assuming you have enough memory to store the sizes.

rcarmo commented 10 years ago

I did a straight zcat dataset.gz | ddfs chunk data:dataset -. I'm now waiting for a wc -L to finish, should take a while. But an earlier test import of zcat .. | head -n 100000 | ddfs chunk... worked, and that should have enough "atypical" samples.

Will report back.