DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Ingest times with CorpusGenerator #24

Closed butlermh closed 12 years ago

butlermh commented 13 years ago

I am seeing quite slow ingest times with CorpusGenerator - around an hour for the Enron dataset. Consider possible improvements:

jnioche commented 13 years ago

Would be good to use a profiler first in order to get a better understanding of what actually takes time (there are quite a few commercial products that have a demo mode and an integration with Eclipse)

Having the CorpusGenerator as a MapReduce job would be good anyway, provided that by default the input path given as argument is considered on the local filesystem, unless of course a different filesystem is explicitely specified (e.g. s3://xxxx). This can be tracked in a separate issue.

butlermh commented 13 years ago

Here is the result of using HProf and HPJMeter. The bulk of the time is being spent in FileInputStream.readBytes.

java.hprof.txt: Call Graph Tree (CPU) (virtual times, partial data, data estimated)
 2 022 795 (100%) java.lang.reflect.Method.invoke
    2 022 795 (100%) sun.reflect.DelegatingMethodAccessorImpl.invoke
       2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke
          2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke0
             2 022 795 (100%) com.digitalpebble.behemoth.util.CorpusGenerator.main
                2 022 795 (100%, 1 other caller) com.digitalpebble.behemoth.util.CorpusGenerator.processFiles
                   2 022 795 (100%) java.io.File.listFiles
                      2 009 652 (100%) com.digitalpebble.behemoth.util.CorpusGenerator$PerformanceFileFilter.accept
                         1 680 312 (100%) java.io.FileInputStream.read
                            1 680 312 (100%) java.io.FileInputStream.readBytes
                           242 275 (100%) org.apache.hadoop.io.SequenceFile$Writer.append
                              242 275 (100%) org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append
                                   89754 (100%) org.apache.hadoop.io.compress.CompressorStream.finish
                                   70727 (100%) java.io.DataOutputStream.flush
butlermh commented 13 years ago

I experimented with three different ways of reading the input: the existing FileFilter, a plain implementation using FileInputStream and recusion, and a NIO implementation - see https://gist.github.com/1003053 for the code.

The performance figures were quite close, and I only ran the tests once, so they are not really conclusive.

FileFilter 33 mins 52 seconds (original in CorpusGenerator) File 33 mins 14 seconds File NIO 32 mins 7 seconds

butlermh commented 13 years ago

Even with a threaded / NIO implementation of CorpusGenerator https://gist.github.com/1005347 with 80 threads, the time varies between 30 and 33 minutes. However for the Enron corpus this creates segments of 8 - 9 Mbytes, which is significantly below the HDFS block size. So a small saving of three minutes but performance will suffer elsewhere.

The only other possibility is to allow ingest from zip / bzip / gzip / tar files. Forqlift does this, it ingests the Enron corpus directly from the .tgz in around 3 minutes. It should be fairly straightforward to reuse some of the Forqlift code in CorpusGenerator.

butlermh commented 13 years ago

This commit https://github.com/butlermh/behemoth/commit/9fabeee5003419930b8e03852a9b7ab68012ff41 adds functionality so it is possible to build a corpus from a .zip, .tar, .tgz or .bz2. In the case of the Enron corpus, this lowers ingest time from 33 minutes to 3 minutes.