Graph generator txt output format error

Arash-Afshar commented 6 years ago

Spark-Bench version (version number, tag, or git commit hash)

spark-bench_2.3.0_0.4.0-RELEASE

Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)

Spark 2.2.0, Yarn

Scala version on your cluster

Your exact configuration file (with system details anonymized for security)

spark-bench = { spark-submit-config = [{ spark-args = { master = "yarn" executor-memory = 5G num-executors = 5 } workload-suites = [ { descr = "Graph Gen" benchmark-output = "console" workloads = [ { name = "graph-data-generator" vertices = 1000 output = "hdfs:///one-thousand-vertex-graph.txt" } ] } ] }] }

Relevant stacktrace

18/04/30 22:21:00 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (**:40656) with ID 1 18/04/30 22:21:00 INFO storage.BlockManagerMasterEndpoint: Registering block manager **:40021 with 2.5 GB RAM, BlockManagerId(1, *****, 40021, None) 18/04/30 22:21:15 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) Exception in thread "main" java.lang.Exception: Unrecognized or unspecified save format. Please check the file extension or add a file format to your arguments: Some(hdfs:///one-thousand-vertex-graph.txt) at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyFormatOrThrow(SparkFuncs.scala:92) at com.ibm.sparktc.sparkbench.utils.SparkFuncs$.verifyOutput(SparkFuncs.scala:35) at com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:49) at com.ibm.sparktc.sparkbench.datageneration.GraphDataGen.run(GraphDataGen.scala:90) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:98) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:98) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially(SuiteKickoff.scala:98) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$2.apply(SuiteKickoff.scala:72) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$2.apply(SuiteKickoff.scala:67) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.run(SuiteKickoff.scala:67) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially$1.apply(MultipleSuiteKickoff.scala:38) at scala.collection.immutable.List.foreach(List.scala:381) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.com$ibm$sparktc$sparkbench$workload$MultipleSuiteKickoff$$runSuitesSerially(MultipleSuiteKickoff.scala:38) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:28) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$$anonfun$run$1.apply(MultipleSuiteKickoff.scala:25) at scala.collection.immutable.List.foreach(List.scala:381) at com.ibm.sparktc.sparkbench.workload.MultipleSuiteKickoff$.run(MultipleSuiteKickoff.scala:25) at com.ibm.sparktc.sparkbench.cli.CLIKickoff$.main(CLIKickoff.scala:30) at com.ibm.sparktc.sparkbench.cli.CLIKickoff.main(CLIKickoff.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 18/04/30 22:21:15 INFO spark.SparkContext: Invoking stop() from shutdown hook

Description of your problem and any other relevant info

Despite using "hdfs:///one-thousand-vertex-graph.txt" as output, it complains about incorrect output format:

Arash-Afshar commented 6 years ago

I tracked it down to this line: https://github.com/CODAIT/spark-bench/blob/be31655ecd8eac5f1b7141cbc5bd6ea640ae0ddc/utils/src/main/scala/com/ibm/sparktc/sparkbench/utils/SparkFuncs.scala#L52

When calling graph data gen, the output is txt, but the function defined at that line does not recognize txt as a valid extension.

justorez commented 6 years ago

I don't think it supports text formatting. You could try to change the output file suffix to .csv

Arash-Afshar commented 6 years ago

It would not work. The documentation of graph data gen states that it should be *.txt: https://codait.github.io/spark-bench/workloads/data-generator-graph/

I have also tried it with a not-txt extension and it had failed with a different error message, saying to choose txt.

hzhuang1 commented 5 years ago

It could be fixed in this pull request. https://github.com/CODAIT/spark-bench/pull/180

CODAIT / spark-bench