databricks / spark-sql-perf

Apache License 2.0
586 stars 407 forks source link

Quantile discretizer benchmark #135

Closed WeichenXu123 closed 6 years ago

WeichenXu123 commented 6 years ago

What's the PR

Quantile discretizer benchmark added.

lu-wang-dl commented 6 years ago

I run the tests successfully using the small test yaml config. This looks fine to me.

WeichenXu123 commented 6 years ago

@jkbradley Updated.

I format the code in MLParams.copy, make one line for each param. This will help to avoid conflicts and easier to fix conflicts if it happened.

jkbradley commented 6 years ago

The changes look fine except that the test won't run with the updated config file. Please test it! Strangely, the test does not seem to pick up the relativeError default value (and the number of buckets is not specified).

WeichenXu123 commented 6 years ago

I pick the log for QuantileDiscretizer:

[info] Running execution com.databricks.spark.sql.perf.mllib.feature.QuantileDiscretizer iteration: 1, StandardRun=true
[error] 18/05/02 16:59:28 INFO MLPipelineStageBenchmarkable: com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable@590f3ff2: benchmark
[error] 18/05/02 16:59:28 INFO MLPipelineStageBenchmarkable: com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable@590f3ff2 beforeBenchmark
[error] 18/05/02 16:59:28 INFO BlockManagerInfo: Removed broadcast_466_piece0 on 127.0.0.1:55023 in memory (size: 3.7 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:28 INFO BlockManager: Removing RDD 1208
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned RDD 1208
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9596
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9588
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9598
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9583
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9652
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9586
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9657
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9664
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9584
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned shuffle 138
[error] 18/05/02 16:59:28 INFO BlockManager: Removing RDD 1192
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned RDD 1192
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9651
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9655
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9656
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9653
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9592
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9713
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9590
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned shuffle 139
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9593
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9587
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9661
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9594
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9597
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9591
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9589
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9654
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9662
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9650
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9649
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9660
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9595
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9659
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9585
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9663
[error] 18/05/02 16:59:28 INFO ContextCleaner: Cleaned accumulator 9658
[error] 18/05/02 16:59:29 INFO SparkContext: Starting job: count at MLPipelineStageBenchmarkable.scala:35
[error] 18/05/02 16:59:29 INFO DAGScheduler: Registering RDD 1231 (count at MLPipelineStageBenchmarkable.scala:35)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Got job 237 (count at MLPipelineStageBenchmarkable.scala:35) with 1 output partitions
[error] 18/05/02 16:59:29 INFO DAGScheduler: Final stage: ResultStage 553 (count at MLPipelineStageBenchmarkable.scala:35)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 552)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 552)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting ShuffleMapStage 552 (MapPartitionsRDD[1231] at count at MLPipelineStageBenchmarkable.scala:35), which has no missing parents
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_468 stored as values in memory (estimated size 26.5 KB, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_468_piece0 stored as bytes in memory (estimated size 10.1 KB, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added broadcast_468_piece0 in memory on 127.0.0.1:55023 (size: 10.1 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO SparkContext: Created broadcast 468 from broadcast at DAGScheduler.scala:1006
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 552 (MapPartitionsRDD[1231] at count at MLPipelineStageBenchmarkable.scala:35) (first 15 tasks are for partitions Vector(0, 1, 2))
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Adding task set 552.0 with 3 tasks
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 0.0 in stage 552.0 (TID 1004, localhost, executor driver, partition 0, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 1.0 in stage 552.0 (TID 1005, localhost, executor driver, partition 1, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 1.0 in stage 552.0 (TID 1005)
[error] 18/05/02 16:59:29 INFO Executor: Running task 0.0 in stage 552.0 (TID 1004)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1228_1 stored as values in memory (estimated size 520.0 B, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1228_0 stored as values in memory (estimated size 520.0 B, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1228_0 in memory on 127.0.0.1:55023 (size: 520.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1228_1 in memory on 127.0.0.1:55023 (size: 520.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO Executor: Finished task 0.0 in stage 552.0 (TID 1004). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO Executor: Finished task 1.0 in stage 552.0 (TID 1005). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 2.0 in stage 552.0 (TID 1006, localhost, executor driver, partition 2, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 2.0 in stage 552.0 (TID 1006)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 0.0 in stage 552.0 (TID 1004) in 8 ms on localhost (executor driver) (1/3)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 1.0 in stage 552.0 (TID 1005) in 8 ms on localhost (executor driver) (2/3)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1228_2 stored as values in memory (estimated size 528.0 B, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1228_2 in memory on 127.0.0.1:55023 (size: 528.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO Executor: Finished task 2.0 in stage 552.0 (TID 1006). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 2.0 in stage 552.0 (TID 1006) in 8 ms on localhost (executor driver) (3/3)
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 552.0, whose tasks have all completed, from pool 
[error] 18/05/02 16:59:29 INFO DAGScheduler: ShuffleMapStage 552 (count at MLPipelineStageBenchmarkable.scala:35) finished in 0.015 s
[error] 18/05/02 16:59:29 INFO DAGScheduler: looking for newly runnable stages
[error] 18/05/02 16:59:29 INFO DAGScheduler: running: Set()
[error] 18/05/02 16:59:29 INFO DAGScheduler: waiting: Set(ResultStage 553)
[error] 18/05/02 16:59:29 INFO DAGScheduler: failed: Set()
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting ResultStage 553 (MapPartitionsRDD[1234] at count at MLPipelineStageBenchmarkable.scala:35), which has no missing parents
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_469 stored as values in memory (estimated size 7.0 KB, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_469_piece0 stored as bytes in memory (estimated size 3.7 KB, free 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added broadcast_469_piece0 in memory on 127.0.0.1:55023 (size: 3.7 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO SparkContext: Created broadcast 469 from broadcast at DAGScheduler.scala:1006
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 553 (MapPartitionsRDD[1234] at count at MLPipelineStageBenchmarkable.scala:35) (first 15 tasks are for partitions Vector(0))
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Adding task set 553.0 with 1 tasks
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 0.0 in stage 553.0 (TID 1007, localhost, executor driver, partition 0, ANY, 4726 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 0.0 in stage 553.0 (TID 1007)
[error] 18/05/02 16:59:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks
[error] 18/05/02 16:59:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
[error] 18/05/02 16:59:29 INFO Executor: Finished task 0.0 in stage 553.0 (TID 1007). 1538 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 0.0 in stage 553.0 (TID 1007) in 10 ms on localhost (executor driver) (1/1)
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 553.0, whose tasks have all completed, from pool 
[error] 18/05/02 16:59:29 INFO DAGScheduler: ResultStage 553 (count at MLPipelineStageBenchmarkable.scala:35) finished in 0.010 s
[error] 18/05/02 16:59:29 INFO DAGScheduler: Job 237 finished: count at MLPipelineStageBenchmarkable.scala:35, took 0.033735 s
[error] 18/05/02 16:59:29 INFO SparkContext: Starting job: count at MLPipelineStageBenchmarkable.scala:38
[error] 18/05/02 16:59:29 INFO DAGScheduler: Registering RDD 1247 (count at MLPipelineStageBenchmarkable.scala:38)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Got job 238 (count at MLPipelineStageBenchmarkable.scala:38) with 1 output partitions
[error] 18/05/02 16:59:29 INFO DAGScheduler: Final stage: ResultStage 555 (count at MLPipelineStageBenchmarkable.scala:38)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 554)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 554)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting ShuffleMapStage 554 (MapPartitionsRDD[1247] at count at MLPipelineStageBenchmarkable.scala:38), which has no missing parents
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_470 stored as values in memory (estimated size 26.5 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_470_piece0 stored as bytes in memory (estimated size 10.1 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added broadcast_470_piece0 in memory on 127.0.0.1:55023 (size: 10.1 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO SparkContext: Created broadcast 470 from broadcast at DAGScheduler.scala:1006
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 554 (MapPartitionsRDD[1247] at count at MLPipelineStageBenchmarkable.scala:38) (first 15 tasks are for partitions Vector(0, 1, 2))
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Adding task set 554.0 with 3 tasks
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 0.0 in stage 554.0 (TID 1008, localhost, executor driver, partition 0, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 1.0 in stage 554.0 (TID 1009, localhost, executor driver, partition 1, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 0.0 in stage 554.0 (TID 1008)
[error] 18/05/02 16:59:29 INFO Executor: Running task 1.0 in stage 554.0 (TID 1009)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1244_0 stored as values in memory (estimated size 520.0 B, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1244_0 in memory on 127.0.0.1:55023 (size: 520.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1244_1 stored as values in memory (estimated size 520.0 B, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1244_1 in memory on 127.0.0.1:55023 (size: 520.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO Executor: Finished task 0.0 in stage 554.0 (TID 1008). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO Executor: Finished task 1.0 in stage 554.0 (TID 1009). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 2.0 in stage 554.0 (TID 1010, localhost, executor driver, partition 2, PROCESS_LOCAL, 4950 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 2.0 in stage 554.0 (TID 1010)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 0.0 in stage 554.0 (TID 1008) in 6 ms on localhost (executor driver) (1/3)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 1.0 in stage 554.0 (TID 1009) in 6 ms on localhost (executor driver) (2/3)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block rdd_1244_2 stored as values in memory (estimated size 528.0 B, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added rdd_1244_2 in memory on 127.0.0.1:55023 (size: 528.0 B, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO Executor: Finished task 2.0 in stage 554.0 (TID 1010). 2527 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 2.0 in stage 554.0 (TID 1010) in 6 ms on localhost (executor driver) (3/3)
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 554.0, whose tasks have all completed, from pool 
[error] 18/05/02 16:59:29 INFO DAGScheduler: ShuffleMapStage 554 (count at MLPipelineStageBenchmarkable.scala:38) finished in 0.012 s
[error] 18/05/02 16:59:29 INFO DAGScheduler: looking for newly runnable stages
[error] 18/05/02 16:59:29 INFO DAGScheduler: running: Set()
[error] 18/05/02 16:59:29 INFO DAGScheduler: waiting: Set(ResultStage 555)
[error] 18/05/02 16:59:29 INFO DAGScheduler: failed: Set()
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting ResultStage 555 (MapPartitionsRDD[1250] at count at MLPipelineStageBenchmarkable.scala:38), which has no missing parents
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_471 stored as values in memory (estimated size 7.0 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_471_piece0 stored as bytes in memory (estimated size 3.7 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added broadcast_471_piece0 in memory on 127.0.0.1:55023 (size: 3.7 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO SparkContext: Created broadcast 471 from broadcast at DAGScheduler.scala:1006
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 555 (MapPartitionsRDD[1250] at count at MLPipelineStageBenchmarkable.scala:38) (first 15 tasks are for partitions Vector(0))
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Adding task set 555.0 with 1 tasks
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 0.0 in stage 555.0 (TID 1011, localhost, executor driver, partition 0, ANY, 4726 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 0.0 in stage 555.0 (TID 1011)
[error] 18/05/02 16:59:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks out of 3 blocks
[error] 18/05/02 16:59:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
[error] 18/05/02 16:59:29 INFO Executor: Finished task 0.0 in stage 555.0 (TID 1011). 1538 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 0.0 in stage 555.0 (TID 1011) in 9 ms on localhost (executor driver) (1/1)
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 555.0, whose tasks have all completed, from pool 
[error] 18/05/02 16:59:29 INFO DAGScheduler: ResultStage 555 (count at MLPipelineStageBenchmarkable.scala:38) finished in 0.009 s
[error] 18/05/02 16:59:29 INFO DAGScheduler: Job 238 finished: count at MLPipelineStageBenchmarkable.scala:38, took 0.029755 s
[error] 18/05/02 16:59:29 INFO MLPipelineStageBenchmarkable: com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable@590f3ff2: train: trainingSet=StructType(StructField(inputCol,DoubleType,false))
[error] 18/05/02 16:59:29 INFO SparkContext: Starting job: approxQuantile at QuantileDiscretizer.scala:151
[error] 18/05/02 16:59:29 INFO DAGScheduler: Got job 239 (approxQuantile at QuantileDiscretizer.scala:151) with 3 output partitions
[error] 18/05/02 16:59:29 INFO DAGScheduler: Final stage: ResultStage 556 (approxQuantile at QuantileDiscretizer.scala:151)
[error] 18/05/02 16:59:29 INFO DAGScheduler: Parents of final stage: List()
[error] 18/05/02 16:59:29 INFO DAGScheduler: Missing parents: List()
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting ResultStage 556 (MapPartitionsRDD[1253] at approxQuantile at QuantileDiscretizer.scala:151), which has no missing parents
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_472 stored as values in memory (estimated size 23.7 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO MemoryStore: Block broadcast_472_piece0 stored as bytes in memory (estimated size 9.3 KB, free 2004.5 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Added broadcast_472_piece0 in memory on 127.0.0.1:55023 (size: 9.3 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO SparkContext: Created broadcast 472 from broadcast at DAGScheduler.scala:1006
[error] 18/05/02 16:59:29 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 556 (MapPartitionsRDD[1253] at approxQuantile at QuantileDiscretizer.scala:151) (first 15 tasks are for partitions Vector(0, 1, 2))
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Adding task set 556.0 with 3 tasks
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 0.0 in stage 556.0 (TID 1012, localhost, executor driver, partition 0, PROCESS_LOCAL, 4961 bytes)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 1.0 in stage 556.0 (TID 1013, localhost, executor driver, partition 1, PROCESS_LOCAL, 4961 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 0.0 in stage 556.0 (TID 1012)
[error] 18/05/02 16:59:29 INFO Executor: Running task 1.0 in stage 556.0 (TID 1013)
[error] 18/05/02 16:59:29 INFO BlockManager: Found block rdd_1244_0 locally
[error] 18/05/02 16:59:29 INFO BlockManager: Found block rdd_1244_1 locally
[error] 18/05/02 16:59:29 INFO Executor: Finished task 1.0 in stage 556.0 (TID 1013). 2504 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO Executor: Finished task 0.0 in stage 556.0 (TID 1012). 2504 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Starting task 2.0 in stage 556.0 (TID 1014, localhost, executor driver, partition 2, PROCESS_LOCAL, 4961 bytes)
[error] 18/05/02 16:59:29 INFO Executor: Running task 2.0 in stage 556.0 (TID 1014)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 1.0 in stage 556.0 (TID 1013) in 4 ms on localhost (executor driver) (1/3)
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 0.0 in stage 556.0 (TID 1012) in 4 ms on localhost (executor driver) (2/3)
[error] 18/05/02 16:59:29 INFO BlockManager: Found block rdd_1244_2 locally
[error] 18/05/02 16:59:29 INFO Executor: Finished task 2.0 in stage 556.0 (TID 1014). 2517 bytes result sent to driver
[error] 18/05/02 16:59:29 INFO TaskSetManager: Finished task 2.0 in stage 556.0 (TID 1014) in 4 ms on localhost (executor driver) (3/3)
[error] 18/05/02 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 556.0, whose tasks have all completed, from pool 
[error] 18/05/02 16:59:29 INFO DAGScheduler: ResultStage 556 (approxQuantile at QuantileDiscretizer.scala:151) finished in 0.007 s
[error] 18/05/02 16:59:29 INFO DAGScheduler: Job 239 finished: approxQuantile at QuantileDiscretizer.scala:151, took 0.011729 s
[error] 18/05/02 16:59:29 INFO MLPipelineStageBenchmarkable: model: quantileDiscretizer_f09114de4da0
[error] 18/05/02 16:59:29 INFO MLPipelineStageBenchmarkable: com.databricks.spark.sql.perf.mllib.MLPipelineStageBenchmarkable@590f3ff2 doBenchmark: Trained model in 0.044 s, Scored training dataset in 0.0 s, test dataset in 0.0 s
[error] 18/05/02 16:59:29 INFO MapPartitionsRDD: Removing RDD 1228 from persistence list
[error] 18/05/02 16:59:29 INFO BlockManager: Removing RDD 1228
[error] 18/05/02 16:59:29 INFO MapPartitionsRDD: Removing RDD 1244 from persistence list
[error] 18/05/02 16:59:29 INFO BlockManager: Removing RDD 1244
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Removed broadcast_469_piece0 on 127.0.0.1:55023 in memory (size: 3.7 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Removed broadcast_468_piece0 on 127.0.0.1:55023 in memory (size: 10.1 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Removed broadcast_472_piece0 on 127.0.0.1:55023 in memory (size: 9.3 KB, free: 2004.6 MB)
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Removed broadcast_471_piece0 on 127.0.0.1:55023 in memory (size: 3.7 KB, free: 2004.6 MB)
[info] Execution time: 0.044s
[error] 18/05/02 16:59:29 INFO BlockManagerInfo: Removed broadcast_470_piece0 on 127.0.0.1:55023 in memory (size: 10.1 KB, free: 2004.6 MB)
WeichenXu123 commented 6 years ago

the test does not seem to pick up the relativeError default value

I don't think so. Even if we do not explicitly set relativeError in yaml config file, the default value will still be picked up, though it won't be printed in the log.

jkbradley commented 6 years ago

LGTM (Except that your other PR conflicted with this one)

WeichenXu123 commented 6 years ago

@jkbradley I want to write a script to generate the MLResult class, because in the copy method we need to copy the param names twice. It is possible to miss something and cause error. (in a separate PR)

jkbradley commented 6 years ago

@WeichenXu123 I agree that the MLResult class is becoming unwieldy. I wonder if using a Map or something would be good. E.g., the MLResult class could be backed with a simple Map. To constrain the possible keys, we could have a list of supportedParams (the single source of truth) which is checked whenever a Map entry is read or written. The Map could also support defaults. What do you think?

This PR LGTM and it ran for me locally. Thanks! Merging with master