In the description of the example the use of MySQL is stressed. Default Hadoop 2.0.0-cdh4.2.1 is installed with postgresql for Hive and Derby for Oozie, which works with no problem using this example.
We didn’t need to install Flume manually either. In the Cloudera Manager you can add Flume as a service. In the page of the service you can add the content of flume.conf in Configuration – Agent (Base). In the same page you can set the agent name to TwitterAgent. When you put the flume-sources-1.0-SNAPSHOT.jar in /usr/share/cmf/lib/plugins/ the jar will be added to FLUME_CLASSPATH in /var/run/cloudera-scm-agent/process/-flume-AGENT/flume-env.sh when the service is started.
However, one issue prevented us to use this service for the example. You have to add com.cloudera.flume.source.TwitterSource to flume.plugin.classes in flume-site.xml. Otherwise you get the error ClassNotFound. We haven’t found a way to do this via the Cloudera Manager. When starting the service a directory /var/run/cloudera-scm-agent/process/-flume-AGENT is created, which includes flume-site.xml. When you restart the service via Cloudera Manager a new directory is created with a different number for . But after changing flume-site.xml you could use this directory to start Flume via the command line.
Concerning the custom Flume Source, it’s probably best to build the source with the right value for hadoop.version (in our case 2.0.0-cdh4.2.1) and flume.version (1.3.0-cdh4.2.1) in pom.xml.
We had some trouble with the time zone. In our case the time zone in coord-app.xml in oozie-workflows had to be changed to "Europe/Amsterdam". In job.properties tzOffset had to be changed to 1, otherwise we got a mismatch between the directory mentioned in the parameter WFINPUT and the DATEHOUR-parameter in Action Configuration of Oozie (viewed via Oozie Web Console).
We didn’t need to install the Oozie ShareLib in HDFS.
We used Hue File Browser to create the necessary directories in HDFS.
It turned out that each time a hive session is started the ADD JAR ; has to be executed again.
We tried the example with the following software:
In the description of the example the use of MySQL is stressed. Default Hadoop 2.0.0-cdh4.2.1 is installed with postgresql for Hive and Derby for Oozie, which works with no problem using this example.
We didn’t need to install Flume manually either. In the Cloudera Manager you can add Flume as a service. In the page of the service you can add the content of flume.conf in Configuration – Agent (Base). In the same page you can set the agent name to TwitterAgent. When you put the flume-sources-1.0-SNAPSHOT.jar in /usr/share/cmf/lib/plugins/ the jar will be added to FLUME_CLASSPATH in /var/run/cloudera-scm-agent/process/-flume-AGENT/flume-env.sh when the service is started.
However, one issue prevented us to use this service for the example. You have to add com.cloudera.flume.source.TwitterSource to flume.plugin.classes in flume-site.xml. Otherwise you get the error ClassNotFound. We haven’t found a way to do this via the Cloudera Manager. When starting the service a directory /var/run/cloudera-scm-agent/process/-flume-AGENT is created, which includes flume-site.xml. When you restart the service via Cloudera Manager a new directory is created with a different number for . But after changing flume-site.xml you could use this directory to start Flume via the command line.
Concerning the custom Flume Source, it’s probably best to build the source with the right value for hadoop.version (in our case 2.0.0-cdh4.2.1) and flume.version (1.3.0-cdh4.2.1) in pom.xml.
We had some trouble with the time zone. In our case the time zone in coord-app.xml in oozie-workflows had to be changed to "Europe/Amsterdam". In job.properties tzOffset had to be changed to 1, otherwise we got a mismatch between the directory mentioned in the parameter WFINPUT and the DATEHOUR-parameter in Action Configuration of Oozie (viewed via Oozie Web Console).
We didn’t need to install the Oozie ShareLib in HDFS.
We used Hue File Browser to create the necessary directories in HDFS.
It turned out that each time a hive session is started the ADD JAR; has to be executed again.