avulanov / ann-benchmark

Benchmarks of artificial neural network library for Spark MLlib
Apache License 2.0
11 stars 9 forks source link

ann benchmark sampling logic #2

Closed weidezhang closed 9 years ago

weidezhang commented 9 years ago

Hi,

I'm reading your ann-bench mark spark version. When you do the following, shouldn't the sampling need to be done for every node ? It seems u just did for once and every node share the same sample data.

val sample = train.sample(true, 1.0 / i, 11L).collect
val parallelData = dataPartitions.flatMap(x => sample)
avulanov commented 9 years ago

Hi, @weidezhang. The code samples 1/i of the data and creates a dataset where each node has the same 1/i of data. This is to guarantee that data is distributed evenly across all i nodes. It is important for measuring the throughput.

weidezhang commented 9 years ago

thanks alexander. it makes sense.