Closed nastra closed 8 years ago
@EnigmaCurry / @aboudreault can you guys review please? Also is benchmark.py
the right place where the spark download / build / execute code should be living?
1) It would be helpful to add a hook for selecting which branch of spark-cassandra-stress to build/run. This would make it easier to update and add patches to the tool during testing. Default could be master.
2) I would also turn on spark-specific metric collection which will grab all the codehale metrics exposed by Spark. Setting this up can be done by copying dse/resources/spark/conf/metrics.properties.template to dse/resources/spark/conf/metrics.properties and enabling the appropriate sink. In the past I've used the CsvSinks, but we may be able to leverage the GraphiteSink, some exploration may be needed to get that working. In the case of spark streaming, it would be nice to be able to grab snapshots of the metrics reported in the Spark Streaming UI tab, but we may be able to rebuild these views from the raw data collected through the sinks.
Link to spark monitoring: http://spark.apache.org/docs/latest/monitoring.html
3)
I would also recommend enabling event logging incase an error occurs we can access the Spark UI info of previously failed jobs. To do this we add the following to spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir /path/to/existingEventLogDir
@rocco408 thanks for the feedback. Probably worth implementing all those suggestions in separate PRs
LGTM, only a minor comment and need a rebase.
@aboudreault rebased and resolved the merge conflict.
@rocco408 I will keep your suggestions on my plate and will come back to you once I find some time to implement them
This PR will add support for running
spark-cassandra-stress
ifdse
is selected in the product dropdown.For
spark-cassandra-stress
, the user needs to specify one particular node. On this node we will then download / build / executespark-cassandra-stress
.