Ingest times with CorpusGenerator

butlermh commented 13 years ago

I am seeing quite slow ingest times with CorpusGenerator - around an hour for the Enron dataset. Consider possible improvements:

Perform ingest as Map task.
Load directly from .zip, .tar or .tgz files like Forqlift.
Stage SequenceFile to local disk before loading to HDFS.

jnioche commented 13 years ago

Would be good to use a profiler first in order to get a better understanding of what actually takes time (there are quite a few commercial products that have a demo mode and an integration with Eclipse)

Having the CorpusGenerator as a MapReduce job would be good anyway, provided that by default the input path given as argument is considered on the local filesystem, unless of course a different filesystem is explicitely specified (e.g. s3://xxxx). This can be tracked in a separate issue.

butlermh commented 13 years ago

Here is the result of using HProf and HPJMeter. The bulk of the time is being spent in FileInputStream.readBytes.

java.hprof.txt: Call Graph Tree (CPU) (virtual times, partial data, data estimated)
 2 022 795 (100%) java.lang.reflect.Method.invoke
    2 022 795 (100%) sun.reflect.DelegatingMethodAccessorImpl.invoke
       2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke
          2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke0
             2 022 795 (100%) com.digitalpebble.behemoth.util.CorpusGenerator.main
                2 022 795 (100%, 1 other caller) com.digitalpebble.behemoth.util.CorpusGenerator.processFiles
                   2 022 795 (100%) java.io.File.listFiles
                      2 009 652 (100%) com.digitalpebble.behemoth.util.CorpusGenerator$PerformanceFileFilter.accept
                         1 680 312 (100%) java.io.FileInputStream.read
                            1 680 312 (100%) java.io.FileInputStream.readBytes
                           242 275 (100%) org.apache.hadoop.io.SequenceFile$Writer.append
                              242 275 (100%) org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append
                                   89754 (100%) org.apache.hadoop.io.compress.CompressorStream.finish
                                   70727 (100%) java.io.DataOutputStream.flush

butlermh commented 13 years ago

I experimented with three different ways of reading the input: the existing FileFilter, a plain implementation using FileInputStream and recusion, and a NIO implementation - see https://gist.github.com/1003053 for the code.

The performance figures were quite close, and I only ran the tests once, so they are not really conclusive.

FileFilter 33 mins 52 seconds (original in CorpusGenerator) File 33 mins 14 seconds File NIO 32 mins 7 seconds

butlermh commented 13 years ago

Even with a threaded / NIO implementation of CorpusGenerator https://gist.github.com/1005347 with 80 threads, the time varies between 30 and 33 minutes. However for the Enron corpus this creates segments of 8 - 9 Mbytes, which is significantly below the HDFS block size. So a small saving of three minutes but performance will suffer elsewhere.

The only other possibility is to allow ingest from zip / bzip / gzip / tar files. Forqlift does this, it ingests the Enron corpus directly from the .tgz in around 3 minutes. It should be fairly straightforward to reuse some of the Forqlift code in CorpusGenerator.

butlermh commented 13 years ago

This commit https://github.com/butlermh/behemoth/commit/9fabeee5003419930b8e03852a9b7ab68012ff41 adds functionality so it is possible to build a corpus from a .zip, .tar, .tgz or .bz2. In the case of the Enron corpus, this lowers ingest time from 33 minutes to 3 minutes.

DigitalPebble / behemoth

Ingest times with CorpusGenerator #24