Closed butlermh closed 12 years ago
Would be good to use a profiler first in order to get a better understanding of what actually takes time (there are quite a few commercial products that have a demo mode and an integration with Eclipse)
Having the CorpusGenerator as a MapReduce job would be good anyway, provided that by default the input path given as argument is considered on the local filesystem, unless of course a different filesystem is explicitely specified (e.g. s3://xxxx). This can be tracked in a separate issue.
Here is the result of using HProf and HPJMeter. The bulk of the time is being spent in FileInputStream.readBytes.
java.hprof.txt: Call Graph Tree (CPU) (virtual times, partial data, data estimated)
2 022 795 (100%) java.lang.reflect.Method.invoke
2 022 795 (100%) sun.reflect.DelegatingMethodAccessorImpl.invoke
2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke
2 022 795 (100%) sun.reflect.NativeMethodAccessorImpl.invoke0
2 022 795 (100%) com.digitalpebble.behemoth.util.CorpusGenerator.main
2 022 795 (100%, 1 other caller) com.digitalpebble.behemoth.util.CorpusGenerator.processFiles
2 022 795 (100%) java.io.File.listFiles
2 009 652 (100%) com.digitalpebble.behemoth.util.CorpusGenerator$PerformanceFileFilter.accept
1 680 312 (100%) java.io.FileInputStream.read
1 680 312 (100%) java.io.FileInputStream.readBytes
242 275 (100%) org.apache.hadoop.io.SequenceFile$Writer.append
242 275 (100%) org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append
89754 (100%) org.apache.hadoop.io.compress.CompressorStream.finish
70727 (100%) java.io.DataOutputStream.flush
I experimented with three different ways of reading the input: the existing FileFilter, a plain implementation using FileInputStream and recusion, and a NIO implementation - see https://gist.github.com/1003053 for the code.
The performance figures were quite close, and I only ran the tests once, so they are not really conclusive.
FileFilter 33 mins 52 seconds (original in CorpusGenerator) File 33 mins 14 seconds File NIO 32 mins 7 seconds
Even with a threaded / NIO implementation of CorpusGenerator https://gist.github.com/1005347 with 80 threads, the time varies between 30 and 33 minutes. However for the Enron corpus this creates segments of 8 - 9 Mbytes, which is significantly below the HDFS block size. So a small saving of three minutes but performance will suffer elsewhere.
The only other possibility is to allow ingest from zip / bzip / gzip / tar files. Forqlift does this, it ingests the Enron corpus directly from the .tgz in around 3 minutes. It should be fairly straightforward to reuse some of the Forqlift code in CorpusGenerator.
This commit https://github.com/butlermh/behemoth/commit/9fabeee5003419930b8e03852a9b7ab68012ff41 adds functionality so it is possible to build a corpus from a .zip, .tar, .tgz or .bz2. In the case of the Enron corpus, this lowers ingest time from 33 minutes to 3 minutes.
I am seeing quite slow ingest times with CorpusGenerator - around an hour for the Enron dataset. Consider possible improvements: