Raw Data? - Githubissues

jlouis commented 12 years ago

Eric, I see you have a branch which is all about producing raw data metrics from the system. I would really like to get hold of those data, but I'll wait until you have settled on a format. The reason is that I want to plot stuff in R first rather than doing statistics on it. But I see you are changing the format so I can't really get to work until you have decided upon the format in question :)

And out of curiosity, what are you using LevelDB for here?

ericmoritz commented 12 years ago

I used LevelDB for collecting the event data. I've written a script that converts the binary Erlang terms into CSV rows. The resulting data is around 3 gigs. Plotting in R was the sole reason why I wrote it. I'll get you a link to the raw data today or tomorrow. It takes a minute or two to generate and there are 10 result sets.

ericmoritz commented 12 years ago

I'm generating it now. I'm going to lunch with the family. I'll put it up on Dropbox sometime today.

jlouis commented 12 years ago

Awesome! I like the work done here.

ericmoritz commented 12 years ago

I'm uploading it now to dropbox. The CSV file compressed really well. It's only a ~500meg bz2.

ericmoritz commented 12 years ago

I am rerunning the node ws test with the new clustering. I'm going to add that to the bz2 with the old one. Please hold.

ericmoritz commented 12 years ago

I'm not including that new benchmark, the new instance's performance characteristics are different. These is the description of the fields.

timestamp, "erlang-cowboy", client_id, event_id, event

facepalm, The second column should be "type". You'll have to fix that. You probably don't want to wait for me to regen the data and it's pretty trivial to rename a column name in R.

Anyhow, these are the definitions:

timestamp - The timestamp the event occurred
"erlang-cowboy" aka type, is the server type. This head is going to be different for each file.
client_id is a string that identifies the client making the request
event_id is a string that matches similar events
event the type of event, enum of "ws_init", "ws_onevent", "send_message" and "recv_message" and "'EXIT'".

For the "'EXIT'" event, I used the crash reason as the event_id. It is still unique for each crash event, it simply serves a dual purpose for crash events. That will help you differentiate what kind of crash it is. I suppose I could of made the event type "{'EXIT', connection_timeout}" but what done is done.

I started writing an R script using precompiled CSV tables based on the event data. You can find that script in priv/ It's probably pretty terrible R code.

The compiled data, which is only 124M is here:

https://www.dropbox.com/s/e9ygu4z8oyr9hla/compiled-data-20120616.tar.bz2

I've precomputed the connection times and message latencies as well summed up all the counts. That may save you some time. It is also what I was using for the R script (which is probably useless to you)

Here's a link to the event dump, it's 515M,

https://www.dropbox.com/s/sk9ilysejmnk3dn/results-20120616.tar.bz2

It would be awesome if you sent a pull request with your R scripts when you're done.

Also, throw out the ruby data, the server crashed quickly with a seg fault but I didn't want to debug the cause while I was in the middle of running the benchmark. I would feel right if you included it with the rest as it was probably a bad library rather than a performance issue.

Have fun.

Eric.

ericmoritz commented 12 years ago

Also, each table is broken up by server type but because the type is in each row, you can easily cat the files together if you pop off the header. I figured it was easier to cat a file rather than split up one giant file.

jlouis commented 12 years ago

Awesome work! I'll grab it during the day and then have a look at it later when I have some time. I'll probably fork the repo here for the R graphs and stuff.

ericmoritz commented 12 years ago

Thanks, I appreciate it.

ericmoritz / wsdemo

Raw Data? #21