Closed harshavardhana closed 5 years ago
anyone?
anyone? no one?
Hi @harshavardhana. Getting an exact benchmark on the time it takes to fetch data from object storage and run queries is actually a difficult problem because jobs in Spark are lazily evaluated and evaluated in parallel. So the data will not be fetched until it is required in the execution graph, and nodes will fetch and process in parallel.
The best approximation I have found for this within spark-bench is still an approximation, and that is to do something like a SELECT COUNT(*)
. That still will count the processing time to do the count, the hope is that will be minimal compared to I/O time. This would also not be a fair way to compare I/O in Spark vs. a different framework where I/O and CPU are more clearly separated because they are not lazily evaluated.
It's also possible to use operating-system level utilities to do I/O measurements outside of the context of spark-bench. That is outside my experience so I can't provide anything further than the suggestion.
@ecurtin do you still have more information about how to interpret the final numbers printed? - because like total_runtime as many columns what do they signify here? just the query time it took? etc.
are there any tools to humanize the output?
@harshavardhana Did you ever found out what are the time units of total runtime and query time? I guess this is cputime in nanoseconds
Hey y'all.
timestamp
is the unix time in milliseconds when the workload started.
Individual action times (ex: load from cache) are done in nanoseconds, and total_runtime
, which is the sum of each individual action, is also nanoseconds.
Here's the time
function: GeneralFunctions.scala#L46
Here's the time
function in use in SparkPi which doesn't have any more granularly defined actions that can be added up, so total_runtime
is everything: SparkPi.scala#L63
Spark-Bench version (version number, tag, or git commit hash)
2.3.0_0.4.0-RELEASE
Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)
Scala version on your cluster
Not sure
Your exact configuration file (with system details anonymized for security)
Relevant stacktrace
Not an error
Description of your problem and any other relevant info
If you can observe that I have provided benchmark-output to be on console, I do get a value though - but I am unable to understand the output and how to interpret them. If you observe I am trying to ascertain how much time does it take to download data from a remote object storage and run queries. I have searched throughout the source code and documentation, but I am not able to find any good indication of what each output field means.
Could you help me understand these query results? or atleast point me towards documentation which provides more details?