Incorrect reading of BZip2 input splits

When reading from a large bzip2 text file (i.e. larger than the HDFS block size), all the mappers read the whole text file instead of their assigned split. For instance, reading a bzip2 text file of 140MB from HDFS with a block size of 128MB, spawn two mappers, each reading the whole input file, i.e. 140MB.

The input splits reported in the log file are not the same as when reading the same input file with an equivalent Java job: Scoobi cuts the input file at the HDFS block boundary (0 to 128MB for the first split, then 128MB to 140MB for the second split), whereas the Java job cuts roughly at the middle of the input file (0 to 70MB and 70MB to 140MB).

I notice that Scoobi manipulates the InputFormat, the InputSplit and the RecordReader in the Source.read method (DataSource.scala#81). Could it be that the Hadoop bzip2 split logic gets jammed at this point?

NICTA / scoobi

Incorrect reading of BZip2 input splits #312