banzaicloud / spark-metrics

Spark metrics related custom classes and sinks (e.g. Prometheus)
Apache License 2.0
176 stars 66 forks source link

Not seeing executor metrics, only driver #51

Closed Drewster727 closed 4 years ago

Drewster727 commented 4 years ago

Describe the bug Not seeing executor metrics (only driver).

Steps to reproduce the issue: Spark 2.3.0 / Hadoop 2.7

metrics.properties:

# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
#*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

# Enable Prometheus for all instances by class name
driver.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=domain.com:9091/
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=false
# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true
*.sink.prometheus.labels=worker=somename,environment=test,type=hadoop,cluster=360
*.sink.prometheus.enable-dropwizard-collector=false
*.sink.prometheus.enable-jmx-collector=true
*.sink.prometheus.jmx-collector-config=jmxCollector.yml

# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
master.source.executors.class=org.apache.spark.metrics.source.JvmSource

jmxCollector.yml

rules:

  # These come from the master
  # Example: master.aliveWorkers
  - pattern: "metrics<name=master\\.(.*)><>Value"
    name: spark_master_$1

  # These come from the worker
  # Example: worker.coresFree
  - pattern: "metrics<name=worker\\.(.*)><>Value"
    name: spark_worker_$1

  # These come from the application driver
  # Example: app-20160809000059-0000.driver.DAGScheduler.stage.failedStages
  - pattern: "metrics<name=(.*)\\.driver\\.(DAGScheduler|BlockManager|jvm)\\.(.*)><>Value"
    name: spark_driver_$2_$3
    type: GAUGE
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate timers for DAGScheduler like messagePRocessingTime
  - pattern: "metrics<name=(.*)\\.driver\\.DAGScheduler\\.(.*)><>Count"
    name: spark_driver_DAGScheduler_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # HiveExternalCatalog is of type counter
  - pattern: "metrics<name=(.*)\\.driver\\.HiveExternalCatalog\\.(.*)><>Count"
    name: spark_driver_HiveExternalCatalog_$2_total
    type: COUNTER
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate histograms for CodeGenerator
  - pattern: "metrics<name=(.*)\\.driver\\.CodeGenerator\\.(.*)><>Count"
    name: spark_driver_CodeGenerator_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate timer (keep only count attribute) plus counters for LiveListenerBus
  - pattern: "metrics<name=(.*)\\.driver\\.LiveListenerBus\\.(.*)><>Count"
    name: spark_driver_LiveListenerBus_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # Get Gauge type metrics for LiveListenerBus
  - pattern: "metrics<name=(.*)\\.driver\\.LiveListenerBus\\.(.*)><>Value"
    name: spark_driver_LiveListenerBus_$2
    type: GAUGE
    labels:
      app_id: "$1"

  # These come from the application driver if it's a streaming application
  # Example: app-20160809000059-0000.driver.com.example.ClassName.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay
  - pattern: "metrics<name=(.*)\\.driver\\.(.*)\\.StreamingMetrics\\.streaming\\.(.*)><>Value"
    name: spark_driver_streaming_$3
    labels:
      app_id: "$1"
      app_name: "$2"

  # These come from the application driver if it's a structured streaming application
  # Example: app-20160809000059-0000.driver.spark.streaming.QueryName.inputRate-total
  - pattern: "metrics<name=(.*)\\.driver\\.spark\\.streaming\\.(.*)\\.(.*)><>Value"
    name: spark_driver_structured_streaming_$3
    labels:
      app_id: "$1"
      query_name: "$2"

  # These come from the application executors
  # Example: app-20160809000059-0000.0.executor.threadpool.activeTasks (value)
  #  app-20160809000059-0000.0.executor.JvmGCtime (counter)
  - pattern: "metrics<name=(.*)\\.(.*)\\.executor\\.(.*)><>Value"
    name: spark_executor_$3
    type: GAUGE
    labels:
      app_id: "$1"
      executor_id: "$2"

  # Executors counters
  - pattern: "metrics<name=(.*)\\.(.*)\\.executor\\.(.*)><>Count"
    name: spark_executor_$3_total
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

  # These come from the application executors
  # Example: app-20160809000059-0000.0.jvm.threadpool.activeTasks
  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.(jvm|NettyBlockTransfer)\\.(.*)><>Value"
    name: spark_executor_$3_$4
    type: GAUGE
    labels:
      app_id: "$1"
      executor_id: "$2"

  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.HiveExternalCatalog\\.(.*)><>Count"
    name: spark_executor_HiveExternalCatalog_$3_count
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

  # These come from the application driver
  # Emulate histograms for CodeGenerator
  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.CodeGenerator\\.(.*)><>Count"
    name: spark_executor_CodeGenerator_$3_count
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

I see metrics flowing in appropriately to the push gateway, but it is only driver metrics, no executor...

spark_driver_jvm_total_used
spark_driver_jvm_total_max
spark_driver_jvm_total_init
....

I found a post from someone who has a similar set of spark/hadoop versions on a cloudera forum, but no answers. https://community.cloudera.com/t5/Support-Questions/Spark-metrics-sink-doesn-t-expose-executor-s-metrics/td-p/281915

I see the same problem with the graphite metric sink built into spark. It occurs whether we're in yarn cluster mode or local spark-submit mode.

Can anyone explain what I'm doing wrong here?

Thanks, Drew

stoader commented 4 years ago

Are you seeing PrometheusSink cannot be instantiated errors in the executor logs? Others in the past experienced this issue due to Yarn not deploying the spark-metrics.jar to executor nodes by the time metrics system is initialised within the executor. The solution to that was to copy upfront spark-metrics.jar and all it's dependencies to executor nodes before executors are being started: https://github.com/banzaicloud/spark-metrics/issues/30#issuecomment-492301988

Drewster727 commented 4 years ago

@stoader I am digging in to see if this error is appearing in executor logs, have not seen it yet. I'll report back and close this out if so. Thanks.

Drewster727 commented 4 years ago

@stoader Even when I take yarn out of the equation, and just run locally via spark-submit, I still only get driver metrics. I ran through the suggestion from @mitchelldavis here: https://github.com/banzaicloud/spark-metrics/issues/30#issuecomment-492301988 i.e. created a pom.xml and then manually specify the jars via --jars command. I see no errors and no indication of anything not loading.

spark-submit --proxy-user livy 
             --conf spark.metrics.namespace=drew 
             --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=7777" 
             --files C:\Spark_Testing\log4j.xml 
             --jars C:\Spark_Testing\collector-0.12.0.jar,C:\Spark_Testing\metrics-core-3.1.2.jar,C:\Spark_Testing\simpleclient-0.3.0.jar,C:\Spark_Testing\simpleclient_dropwizard-0.3.0.jar,C:\Spark_Testing\simpleclient_pushgateway-0.3.0.jar,C:\Spark_Testing\spark-metrics_2.11-2.3-3.0.1.jar,C:\Spark_Testing\hive-jdbc-1.2.1000.2.6.5.84-2.jar,C:\Spark_Testing\hive-service-1.2.1000.2.6.5.84-2.jar,C:\Spark_Testing\spark-sql-kafka-0-10_2.11-2.3.0.2.6.5.84-2.jar,C:\Spark_Testing\spark-streaming-kafka-0-10_2.11-2.3.0.2.6.5.84-2.jar,C:\Spark_Testing\elasticsearch-spark-20_2.11-6.8.1.jar 
             --class com.test.Application C:\_code\scala-2.11\application.jar local

I also dug through any and all metrics on my yarn cluster and could not see any complaints of not being able to instantiate the PrometheusSink.

Any thoughts or suggestions?

stoader commented 4 years ago

can you see in your executor logs that the PrometheusSink is being instantiated?

Drewster727 commented 4 years ago

hi @stoader sorry for the late reply -- yes, after doing some further review I do see this in the logs only for the executor:

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1887)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.banzaicloud.metrics.sink.PrometheusSink
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
    at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
    at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
    at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
    at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
    at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:228)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
    ... 4 more

Which is odd considering the driver has access and I'm providing the repository and jars packages:

application_spark_metrics_conf = "metrics.properties"
application_spark_metrics_namespace = "spark"
application_spark_jars_repositories = "http://repo.hortonworks.com/content/repositories/releases,https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases"
application_spark_jars_packages = "org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.2,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2,org.elasticsearch:elasticsearch-spark-20_2.11:6.8.1,com.banzaicloud:spark-metrics_2.11:2.3-3.0.1,io.prometheus:simpleclient:0.8.1,io.prometheus:simpleclient_dropwizard:0.8.1,io.prometheus:simpleclient_pushgateway:0.8.1,io.dropwizard.metrics:metrics-core:3.1.2"
application_files = ["/hdfs/path/log4j.xml","/hdfs/path/metrics.properties","/hdfs/path/jmxCollector.yml"]

All of those vars get sent to livy with the spark worker deployment. The files in the /hdfs/path/... paths all exist in HDFS and are accessible. Also -- if the answer is to just copy the jars manually... which ones? and where? can I put them into the same hdfs path that my metric.properties/jmxCollector.yml files are?

Thoughts?

Thanks, Drew

Drewster727 commented 4 years ago

Also -- just tested dropping the following jars on each node and in HDFS:

hdfs dfs -ls /opt/prometheus/jars/
Found 6 items
-rw-r--r--   3 hdfs hdfs     112558 2020-01-28 17:17 /opt/prometheus/jars/metrics-core-3.1.2.jar
-rw-r--r--   3 hdfs hdfs     105245 2020-01-28 17:17 /opt/prometheus/jars/metrics-core-4.1.2.jar
-rw-r--r--   3 hdfs hdfs       5823 2020-01-28 17:17 /opt/prometheus/jars/simpleclient_common-0.8.1.jar
-rw-r--r--   3 hdfs hdfs      16319 2020-01-28 17:17 /opt/prometheus/jars/simpleclient_dropwizard-0.8.1.jar
-rw-r--r--   3 hdfs hdfs       9335 2020-01-28 17:17 /opt/prometheus/jars/simpleclient_pushgateway-0.8.1.jar
-rw-r--r--   3 hdfs hdfs     135208 2020-01-28 17:17 /opt/prometheus/jars/spark-metrics_2.11-2.3-3.0.1.jar

I told my livy/yarn job to look for these in the files designation.

Am I missing any there?

Drewster727 commented 4 years ago

I seem to be able to get further if I tell yarn/livy that it can look for jars via the extraClassPath /opt/prometheus/jars Now it's throwing some snake/yaml dependency errors. Maybe I need to copy more jars out there? At this point I'm not sure I can rely on this... which is too bad :(

stoader commented 4 years ago

Which is odd considering the driver has access and I'm providing the repository and jars packages:

The only reason I can imagine, what also was reported by others as well, that executors are slow to download the jars (compared to the driver) and jars are not there by the time executor initialises its metrics system (thus throws java.lang.ClassNotFoundException: org.apache.spark.banzaicloud.metrics.sink.PrometheusSink). The only solution to that is to copy all necessary jars to the node where executes will run upfront before executors are started.

You can download the spark-metrics jar and all its dependencies to a temp directory using the following steps:

  1. mvn dependency:get -DgroupId=com.banzaicloud -DartifactId=spark-metrics_2.11 -Dversion=2.3-2.1.0
  2. mkdir temp
  3. mvn dependency:copy-dependencies -f ~/.m2/repository/com/banzaicloud/spark-metrics_2.11/2.3-2.1.0/spark-metrics_2.11-2.3-2.1.0.pom -DoutputDirectory=$(pwd)/temp
Drewster727 commented 4 years ago

@stoader did finally get it to work, had to drop these jars on each node in the cluster into a specific directory, then tell livy/yarn to look there:

collector-0.12.0.jar
metrics-core-4.1.2.jar
simpleclient-0.8.1.jar
simpleclient_common-0.8.1.jar
simpleclient_dropwizard-0.8.1.jar
simpleclient_pushgateway-0.8.1.jar
snakeyaml-1.16.jar
spark-metrics_2.11-2.3-3.0.1.jar

My issue with this is that I have to manually maintain these jars on each node. Does anyone know how to get yarn to look in an hdfs path for these?

stoader commented 4 years ago

Did you try also dropping these jars to the same path where the standard spark jars (spark/jars) live? The path is on Spark's class path.

I'm not sure if there is a way to tell Yarn to download jars required by spark executors from HDFS.

Drewster727 commented 4 years ago

@stoader I did not try dropping the jars into the same path where the standard spark jars live. However, that would still require me to drop the jars on every node in the cluster, so I'm not sure I would gain anything there. Bummer on the HDFS part... that would be super handy, but I understand that's a yarn/spark issue.

Drewster727 commented 4 years ago

I'm closing this for now. Got it working by dropping jars on the nodes as outlined.

prakashatul1 commented 4 years ago

If you are running in cluster mode you will get executors metrics from other nodes where executor might be running ... worked for me