Install and fire up the Sandbox using the instructions here: http://maprdocs.mapr.com/home/SandboxHadoop/c_sandbox_overview.html.
Use an SSH client such as Putty (Windows) or Terminal (Mac) to login. See below for an example: use userid: user01 and password: mapr.
For VMWare use: $ ssh user01@ipaddress
For Virtualbox use: $ ssh user01@127.0.0.1 -p 2222
You can build this project with Maven using IDEs like Eclipse or NetBeans, and then copy the JAR file to your MapR Sandbox, or you can install Maven on your sandbox and build from the Linux command line, for more information on maven, eclipse or netbeans use google search.
After building the project on your laptop, you can use scp to copy your JAR file from the project target folder to the MapR Sandbox:
use userid: user01 and password: mapr. For VMWare use: $ scp nameoffile.jar user01@ipaddress:/user/user01/.
For Virtualbox use: $ scp -P 2222 nameoffile.jar user01@127.0.0.1:/user/user01/.
Copy the data file from the project data folder to the sandbox using scp to this directory /user/user01/data/uber.csv on the sandbox:
For Virtualbox use: $ scp -P 2222 data/uber.csv user01@127.0.0.1:/user/user01/data/.
This example runs on MapR 5.2.1 with Spark 2.1
By default, Spark classpath doesn't contain spark-streaming-kafka-producer_2.11-2.1.0-mapr-1703.jar and spark-streaming-kafka-0-9_2.11-2.1.0-mapr-1703.jar
which you can get here:
http://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-0-9_2.11/2.1.0-mapr-1703/spark-streaming-kafka-0-9_2.11-2.1.0-mapr-1703.jar
There are 3 ways:
note use: mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar if you did not update the spark classpath
Step 0: Create the topics to read from (ubers) and write to (uberp) in MapR streams:
maprcli stream create -path /user/user01/stream -produceperm p -consumeperm p -topicperm p
maprcli stream topic create -path /user/user01/stream -topic ubers
maprcli stream topic create -path /user/user01/stream -topic uberp
to get info on the ubers topic : maprcli stream topic info -path /user/user01/stream -topic ubers
to delete topics:
maprcli stream topic delete -path /user/user01/stream -topic ubers
maprcli stream topic delete -path /user/user01/stream -topic uberp
Step 1: Run the Spark k-means program which will create and save the machine learning model:
spark-submit --class com.sparkml.uber.ClusterUber --master local[2] mapr-sparkml-streaming-uber-1.0.jar
you can also copy paste the code from ClusterUber.scala into the spark shell
$spark-shell --master local[2]
Alternative Step 1:
If you want to skip the machine learning and publishing part, the JSON results are in a file in the directory data/cluster.txt scp this file to /user/user01/data/cluster.txt
Then run the Java producer to produce messages with the topic and data file arguments:
java -cp mapr-sparkml-streaming-uber-1.0.jar:mapr classpath
com.streamskafka.uber.MsgProducer /user/user01/stream:uberp /user/user01/data/cluster.txt
To run the MapR Streams Java consumer with the topic to read from (uberp):
java -cp mapr-sparkml-streaming-uber-1.0.jar:mapr classpath
com.streamskafka.uber.MsgConsumer /user/user01/stream:uberp
Then you can skip to Step 4
After creating and saving the k-means model you can run the Streaming code:
Step 2: Run the MapR Streams Java producer to produce messages, run the Java producer with the topic (ubers) and data file arguments (uber.csv):
java -cp mapr-sparkml-streaming-uber-1.0.jar:mapr classpath
com.streamskafka.uber.MsgProducer /user/user01/stream:ubers /user/user01/data/uber.csv
Optional: run the MapR Streams Java consumer to see what was published :
java -cp mapr-sparkml-streaming-uber-1.0.jar:mapr classpath
com.streamskafka.uber.MsgConsumer /user/user01/stream:ubers
Step 3: Run the Spark Consumer Producer with the arguments: path to model, subscribe topic, publish topic : (in separate consoles if you want to run at the same time)
spark-submit --class com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] \ mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar /user/user01/data/savemodel /user/user01/stream:ubers /user/user01/stream:uberp
Step 4: Run the Spark Consumer which consumes the enriched messages with the topic to read from (uberp):
spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:uberp
Step 5: Spark Streaming writing to MapR-DB HBase
Configure the HBase connector as documented here http://maprdocs.mapr.com/home/Spark/ConfigureSparkHBaseConnector.html
start the hbase shell and create a table: hbase shell create '/user/user01/db/uber', {NAME=>'data'}
Run the code to Read from MapR Streams and write to MapR-DB HBAse
spark-submit --class com.sparkkafka.uber.SparkKafkaConsumerWriteHBase --master local[2] \ mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar /user/user01/stream:uberp "/user/user01/db/uber"
start the hbase shell and scan to see results: hbase shell scan '/user/user01/db/uber' , {'LIMIT' => 5}
Step 6: Spark Reading from MapR-DB HBase
Run the code to Read from MapR-DB HBAse into a spark dataframe
spark-submit --class com.sparkkafka.uber.SparkHBaseReadDF --master local[2] \ mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar "/user/user01/db/uber"
to delete the ubers topic after using : maprcli stream topic delete -path /user/user01/stream -topic ubers
Other examples included:
from https://github.com/mapr/spark/tree/2.1.0-mapr-1703/examples/src/main/scala/org/apache/spark/examples/streaming
Spark Streaming Consuming from MapR-Streams:
spark-submit --class com.sparkkafka.uber.V09DirectKafkaWordCount --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:ubers
Spark Streaming publishing to MapR-Streams: spark-submit --class com.sparkkafka.uber.KafkaProducerExample --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:ubert