Hadoop jobs are compiled against hadoop-1.0.3 and will conflict with older versions

alexanderfahlke commented 11 years ago

The classic one:

ClassNotFoundException

There is at least one dependency to hadoop-1.0.3 (I guess you guys at LinkedIn are using this).

In hadoop-0.20.2 the hadoop-core.jar is located in the hadoop base path and when running run.sh it throws an exception:

log4j:ERROR Could not instantiate class [org.apache.hadoop.metrics.jvm.EventCounter].
java.lang.ClassNotFoundException: org.apache.hadoop.metrics.jvm.EventCounter
...
log4j:ERROR Could not instantiate appender named "EventCounter".
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/Job
        at com.linkedin.whiteelephant.ProcessLogs.<init>(ProcessLogs.java:60)
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.Job

If you copy the hadoop-core.jar to the libs directory you get the next exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat
        ...
        com.linkedin.whiteelephant.parsing.ParseJobsFromLogs.execute(ParseJobsFromLogs.java:158)
        at com.linkedin.whiteelephant.ProcessLogs.run(ProcessLogs.java:72)
        at com.linkedin.whiteelephant.ProcessLogs.main(ProcessLogs.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat

I know that this was a dumb idea, but it was worth trying. ;)

The dependency is found in white-elephant/hadoop/config/ivy/ivy.xml:

<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.0.3" conf="hadoop->default"/>
<dependency org="org.apache.hadoop" name="hadoop-tools" rev="1.0.3" conf="hadoop->default"/>

I would suggest to name the tested version in the README for the Hadoop-Jobs.

Funny side-note from README.md (Server section):

White Elephant does not assume a specific version of Hadoop, so the JARs are not packaged in the WAR. Therefore the path to the Hadoop JARs must be specified in the configuration.

So the white-elefant front-end does not depend on a specific hadoop version but the jobs to generate the data do.

Seems like the following two JIRA tickets show the first problem: HADOOP-7055 HADOOP-7577

matthayes commented 11 years ago

The 1.0.3 dependency in ivy.xml is just so that it has some version of Hadoop to compile against. I've actually tested it against a different 1.0.x version, but not 0.20.2 :) While it has the 1.0.3 dependency in ivy.xml, the Hadoop JARs download to a different directory and are not included in the fat jar. This is so you can use the JARs for the same version as your cluster.

So it doesn't have a dependency on a specific version of Hadoop, but that being said it does need to find the classes it was compiled against. Hadoop 0.20.2 apparently doesn't have these classes so it's not going to work it seems. You should have success with any 1.0.x version, or later ones too perhaps but I haven't tested it so can't say.

I'll add a comment in the README regarding what version it was compiled against and expected compatibility.

matthayes commented 11 years ago

By the way, the reason for the dependency on CombineFileInputFormat is so that it can combine the log files. CombinedTextInputFormat derives from this and is used for this purpose. Otherwise you'll get one mapper per log file, which can mean many mappers :)

LinkedInAttic / white-elephant

Hadoop jobs are compiled against hadoop-1.0.3 and will conflict with older versions #5