BigData series - Githubissues

kali commented 8 years ago

Hey,

I wanted to let you know there is an early version of the part 5 of my big data series here: https://github.com/kali/kali.github.io/blob/master/_drafts/Embrace_the_glow_cloud.md

The results graphs are here: https://github.com/kali/kali.github.io/blob/master/assets/2016-02-11-m3xl.png https://github.com/kali/kali.github.io/blob/master/assets/2016-02-11-m32xl.png https://github.com/kali/kali.github.io/blob/master/assets/2016-02-11-c38xl.png

I am a bit puzzled by the fact the huge c3.8xl instances are performing so badly in network configuration (probing show network is fairly good, 6 or 7 several GB/s whether or not there are started in the same "placement group").

I will also investigate if my use of ansible as a way to start the processes is not offsetting their start time enough to forbid getting under the 7 or 8 second limit. I'll try to find some trick to start all instances more simultaneously.

Anyway. If you want to shed some wisdom, feel free to.

frankmcsherry commented 8 years ago

Cool!

I'm not really sure about the c3.8xl instances. I wouldn't be surprised if you came back and said "turns out XYZ is bad" where XYZ is my fault, but nothing leaps out at me mentally. Am I reading the data right that they are alarmingly fast on single-node? It looks like 20s or so, right? Or are the 0, 1 measurements something I don't understand?

The others seem to scale kind-of well, though, right? Up to some point. If you have the option, one great chart to plot is the y-axis = $s = time * rate plot, so folks can see where the right "price point" is. Or maybe just: "for how long is the line flat-ish", if there isn't super-linear scaling.

The execute closure shouldn't (I think) start running until all point-to-point network connections are established, so if you do a Rust style timer in there, you could get a better sense for whether the 7-8s is start-up skew, or if it just takes that long. I think there are probably scaling issues yet to solve with timely, so it wouldn't be outrageous for it not to scale indefinitely.

But, very neat that you are doing this!

Just as a warning, a PR just landed for timely that tweaks a few unary interfaces. It fixes up (we think) the issue you had early on with being allowed to mis-use the notifications and such, but at the same time it might break some code (it should break your old code). Andrea Lattuada (utaal on github) is on call to help fix breakage, but I'm happy to help out too if you see issues (or you could not re-pull before you finish your measurements).

frankmcsherry commented 8 years ago

One possible idea about the c3.8xl: how many workers are you starting up for each? It is great to leave a few cores free for managing the network connections; as all workers busy-wait, if there are 32 worker cores producing data that the networking threads fall behind on, that could be a cause of problems. Just guessing here, though. :)

kali commented 8 years ago

yes, I think you're reading the plot fine. c3.8xl are fast in standalone, but they are huge beasts with 36 cores. m3.xl only have 4... So 16s compared to the 75s of m3.xl is not that impressive. But disk speed is not increased in the same proportions, so they may actually be bound by disk, but that should disappear when we scale out, as they get less and less data to load.

I'll try to plot time*rate, that's very relevant indeed.

I'll try to put some place to get a "started" time somewhere to measure a skew.

I have used 2*num_cpus as the number of workers everywhere so far. Early measurements on the laptop did not show the parameter had much of an impact, but I have not tested it in distributed mode. I'll add that to the list :)

Thanks !

frankmcsherry commented 8 years ago

I'd give num_cpus - 2 a try, just to leave a few cores open for networking send/recv threads. At least, try something like that in your mix. :)

ms705 commented 8 years ago

Neat results, thanks for doing this exploration! :)

One possible hypothesis for the c3 instance badness (completely speculative): Amazon sells these instances as "compute-optimized" and may thus assume that you use them for largely compute-bound workloads with local I/O. Hence, their network traffic may be de-prioritized, or their network bandwidth reduced compared to other instance types. (Also, at least a while ago, the standard network bandwidth on EC2 was 1 GBit/s, not 10 GBit/s.)

You could investigate this by running a couple of simple iperf and ping measurements between pairs of c3 and m3 instances and comparing the results.

(Another possibility is that the number of workers on each c3 instance is significantly larger than on the m3 instances, if I understand correctly, which may lead to the network connection being oversubscribed. I think we had some similar issues when trying timely on 24+-core machines while doing the PageRank experiments a while ago.)

One thing that helps debug timely's network and I/O performance is to run the collectl utility while running the query, and having a look at its output at a 1s or 0.1s sampling frequency. We used this method to produce the machine utilization time series plots in the PageRank experiments; there's a handy-if-hacky script here that produces these plots. The script takes space-separated collectl output as its input (I think).

Minor comment on the graphs (though I realize that they're in draft form): it'd be good to have y-axis and x-axis labels to indicate what's shown. I also think that you can get away with connecting the data points by a line (even though the x-axis metric isn't continuous) for readability. Alternatively, you could use a bar chart to make things a bit clearer.

kali commented 8 years ago

Thanks for the feedback. I have been able to try a few things already.

ansible and the way I start the cluster consistently adds a 1 second overhead. I know take a timestamp in the execute closure as Frank suggested.
playing with the worker counts on the c3.8xl makes a huge difference. I got under the 5 seconds bar with a 5-node cluster (8$/hr) but the sweet spot is lower than we expected: about ~10 workers, so... num_cpus / 3 or / 4. I'll try to see what happens with less extreme configurations in the c3 series.

frankmcsherry commented 8 years ago

Neat! This is great to learn. I wonder if the worker count tweaks affect the m-class machines too? Probably less if so, but it could be non-trivial. It might open up a bit more scaling, too (maybe; who knows ;))

kali commented 8 years ago

I have not documented that in the blog post, so, for the 4-core m3.xlarge, the right value for the number of workers is between 2 (for a large cluster) and 3. Adjusting it makes a big difference so it's worth it, but finding some thumb rule will be tricky: it depends on both the cluster size, we were expecting that, but it turns out the query variant in my exercise has an important impact too.

frankmcsherry commented 8 years ago

Heyo. So, over in https://github.com/frankmcsherry/timely-dataflow/issues/27 we are chatting about a generic aggregation operator, of the sort that you have gotten stuck writing each time. :)

If you have a chance, it would be great to get your thoughts on whether it looks like it would fit your requirements (modulo wanting to swap in replacements for HashMap and the aggregation infrastructure you've been building up).

frankmcsherry commented 5 years ago

Closing due to elapsed time, and likely resolution (at least, different issues).

TimelyDataflow / timely-dataflow

BigData series #19