CODAIT / spark-bench

Benchmark Suite for Apache Spark
https://codait.github.io/spark-bench/
Apache License 2.0
238 stars 123 forks source link

Understanding benchmark-output #178

Closed harshavardhana closed 5 years ago

harshavardhana commented 6 years ago

Spark-Bench version (version number, tag, or git commit hash)

2.3.0_0.4.0-RELEASE

Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)

$ ls ../install
hadoop-3.1.0  hadoop-3.1.0.tar.gz  spark-2.3.1-bin-without-hadoop  spark-2.3.1-bin-without-hadoop.tgz

Scala version on your cluster

Not sure

Your exact configuration file (with system details anonymized for security)

$ cat minio-csvs.conf 
spark-bench = {
  spark-submit-config = [{
    suites-parallel = false
    workload-suites = [
      {
        descr = "Run SQL queries over s3a"
        benchmark-output = "console"
        parallel = true
        repeat = 10
        workloads = [
          {
            name = "sql"
            input = ["s3a://csvs/1.csv", "s3a://csvs/2.csv", "s3a://csvs/3.csv", "s3a://csvs/4.csv"]
            query = ["select * from input", "select * from input", "select * from input", "select * from input"]
            cache = false
          }
        ]
      }
    ]
  }]
}

Relevant stacktrace

Not an error

Description of your problem and any other relevant info

If you can observe that I have provided benchmark-output to be on console, I do get a value though - but I am unable to understand the output and how to interpret them. If you observe I am trying to ascertain how much time does it take to download data from a remote object storage and run queries. I have searched throughout the source code and documentation, but I am not able to find any good indication of what each output field means.

+----+-------------+-------------+---+-----+--------+-------------------+----------+--------+------+---------+----------------+-------------+--------------------+-----------------+--------------------+--------------------+-----------------+-----------------------+--------------------+--------------------+--------------------+
|name|    timestamp|total_Runtime|run|cache|saveTime|           queryStr|  loadTime|saveMode|output|queryTime|           input|numPartitions|   spark.driver.host|spark.driver.port|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|        spark.master|        spark.app.id|         description|
+----+-------------+-------------+---+-----+--------+-------------------+----------+--------+------+---------+----------------+-------------+--------------------+-----------------+--------------------+--------------------+-----------------+-----------------------+--------------------+--------------------+--------------------+
| sql|1538719721173|   5717293543|  0|false|       0|select * from input|5711993602|   error|      |  5299941|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719721770|   6107413075|  0|false|       0|select * from input|6102674729|   error|      |  4738346|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719721343|   5388230678|  0|false|       0|select * from input|5383225466|   error|      |  5005212|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719719049|   5084348634|  0|false|       0|select * from input|5077074518|   error|      |  7274116|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719722539|   5704801743|  0|false|       0|select * from input|5699921791|   error|      |  4879952|s3a://csvs/2.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719722693|   6399647686|  0|false|       0|select * from input|6395172555|   error|      |  4475131|s3a://csvs/2.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719719738|   4767888690|  0|false|       0|select * from input|4762033882|   error|      |  5854808|s3a://csvs/2.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719720244|   3565054213|  0|false|       0|select * from input|3390200040|   error|      |174854173|s3a://csvs/2.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719720282|   6406844961|  0|false|       0|select * from input|6400737569|   error|      |  6107392|s3a://csvs/3.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719721486|   6291858458|  0|false|       0|select * from input|6287060762|   error|      |  4797696|s3a://csvs/3.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719718956|   7063084324|  0|false|       0|select * from input|7056428183|   error|      |  6656141|s3a://csvs/3.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719718042|   7365855398|  0|false|       0|select * from input|7358431781|   error|      |  7423617|s3a://csvs/3.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719721559|   6418892521|  0|false|       0|select * from input|6414061025|   error|      |  4831496|s3a://csvs/4.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719720334|   7209013412|  0|false|       0|select * from input|7201856680|   error|      |  7156732|s3a://csvs/4.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719717793|   7607773942|  0|false|       0|select * from input|7600718613|   error|      |  7055329|s3a://csvs/4.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719718895|   5399619365|  0|false|       0|select * from input|5393791095|   error|      |  5828270|s3a://csvs/4.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719734284|   6257903150|  1|false|       0|select * from input|6254028225|   error|      |  3874925|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719735752|   2302247909|  1|false|       0|select * from input|2297868646|   error|      |  4379263|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719733371|   5950960336|  1|false|       0|select * from input|5946366723|   error|      |  4593613|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
| sql|1538719733328|   4773555907|  1|false|       0|select * from input|4769022288|   error|      |  4533619|s3a://csvs/1.csv|             |ip-172-31-69-152....|            34689|file:/home/centos...|com.ibm.sparktc.s...|           driver|                 client|spark://172.31.69...|app-2018100506082...|Run SQL queries o...|
+----+-------------+-------------+---+-----+--------+-------------------+----------+--------+------+---------+----------------+-------------+--------------------+-----------------+--------------------+--------------------+-----------------+-----------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Could you help me understand these query results? or atleast point me towards documentation which provides more details?

harshavardhana commented 6 years ago

anyone?

harshavardhana commented 6 years ago

anyone? no one?

ecurtin commented 6 years ago

Hi @harshavardhana. Getting an exact benchmark on the time it takes to fetch data from object storage and run queries is actually a difficult problem because jobs in Spark are lazily evaluated and evaluated in parallel. So the data will not be fetched until it is required in the execution graph, and nodes will fetch and process in parallel.

The best approximation I have found for this within spark-bench is still an approximation, and that is to do something like a SELECT COUNT(*). That still will count the processing time to do the count, the hope is that will be minimal compared to I/O time. This would also not be a fair way to compare I/O in Spark vs. a different framework where I/O and CPU are more clearly separated because they are not lazily evaluated.

It's also possible to use operating-system level utilities to do I/O measurements outside of the context of spark-bench. That is outside my experience so I can't provide anything further than the suggestion.

harshavardhana commented 5 years ago

@ecurtin do you still have more information about how to interpret the final numbers printed? - because like total_runtime as many columns what do they signify here? just the query time it took? etc.

are there any tools to humanize the output?

eddytruyen commented 3 years ago

@harshavardhana Did you ever found out what are the time units of total runtime and query time? I guess this is cputime in nanoseconds

ecurtin commented 3 years ago

Hey y'all.

timestamp is the unix time in milliseconds when the workload started.

Individual action times (ex: load from cache) are done in nanoseconds, and total_runtime, which is the sum of each individual action, is also nanoseconds.

Here's the time function: GeneralFunctions.scala#L46

Here's the time function in use in SparkPi which doesn't have any more granularly defined actions that can be added up, so total_runtime is everything: SparkPi.scala#L63