distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

pidstat: make default collection interval larger. #700

Open ndokos opened 7 years ago

ndokos commented 7 years ago

@ekuric ran into problems when postprocessing pidstat data after a long run: the pidstat pp grew to 32GB of memory and the system started OOM killing things.

mrsiano commented 5 years ago

@ekuric @ndokos @atheurer guys, looks like the pp will hold large amount of memory when parsing large results set, in our case if the pidstat-stdout.txt size is 32GB the pp will populate the $pidstats var [1] with the same amount of data and eventually send it to gen_data func [2], up until that point the pp memory will be equivalent to pidstat file, since the pp will only change its format.

the another alternative we can consider is: call the gen_data inside the while $line loop, and more importantly load just portion of the file and not the entire file to memory.

[1] https://github.com/distributed-system-analysis/pbench/blob/19fa2ad408214df97fab06da1d03aea020590299/agent/tool-scripts/postprocess/pidstat-postprocess#L31

[2] https://github.com/distributed-system-analysis/pbench/blob/19fa2ad408214df97fab06da1d03aea020590299/agent/tool-scripts/postprocess/pidstat-postprocess#L183

mrsiano commented 5 years ago

@atheurer @ekuric is that possible to build the csv file in stages something like:

my $lastchunk = '';
my $buffer = '';
open(FILE, $filename) or die "Can't open `$filename': $!";
while (sysread FILE, $buffer, 4096) {
         # do the parsing..
         ...
         # do the gen_data in portions.
         gen_data(\%pidstat, \%graph_type, \%graph_threshold, $dir);
}
portante commented 5 years ago

Seems like gen_data needs to take some kind of on-disk intermediate form that each post-processing in the short term to address the memory consumption. Having it take in a perl variable containing all of the data in memory is not really scalable.

Also, we need to consider NOT doing this kind of post-processing in the agent at all. Having the agents just collect the data, and any necessary metadata would make the agents much leaner and less memory and CPU intensive.

The post-processing step can be then done server side with all the right context so that it can be handled efficiently.

Having the agent be responsible for this processing means that we have to update the agent to get the data processing right. If we keep the agents simple, then we can fix data processing problems like this on the server side without having to require updates to the agent.

ekuric commented 5 years ago

The post-processing step can be then done server side with all the right context so that it can be handled efficiently. I think it would be even better if postprocess has option to be moved to server side

atheurer commented 5 years ago

Server side works if you are certain you always have a server in your infra (which we do, but other may not). I am not against server side, but for the short term it should be really easy to make gen_data use a lot less memory, most likely by calling it for each graph instead of passing a hash of all the graphs' data.

atheurer commented 5 years ago

I will look at the gen-data memory optimization when I am back from PTO next week.

portante commented 5 years ago

Don't we always want to keep the agent's dead simple?

So if we have a simple agent that just collects and does not process data, can't we provide to those that don't have a server side solution a way to do what the server would do locally so that we don't make the agent complicated for all?

portante commented 5 years ago

[From an email from @mrsiano]

Regarding PR #1015, I've tried to come up with some POC which will help us to improve post-processing response time and space complexity.

After some discussion with @atheurer, we should improve this post-processing from another point of view. I’ve created a small POC to introduce some other chunking solution, in that approach we chunk each an every pidstat command interval.

This approach was initially introduced for the jmap tool.

It has very low overhead, and the initial output can be in a csv.

Then all we need to care about in the post-processing stage is just rotating the command from rows to a column.

PR for poc pidstat new gen

I think about using different methods on top of this change:

  1. consume and parse pidstat (it almost done any parsing @tool script level)
  2. do the parsing on post-processing, but chunk the data per metric on the agent side as introduced at https://gist.github.com/mrsiano/cc01500b9bc0ff7465ac9445ace52699
  3. same as bullet 2, but send the pidstat data into some parser server (java \ python), I already have something to go with from another project, some stdin consumer, and dump results directly into Prometheus instance, since all csv primary key is timestamp.

The time complexity I introduced are quite bad, we can have better execution time, and best effort is to run by O(n) per all timestamps (every pidstat command execution).

Single pidstat + parsing will cost ~1.6 sec.

Eventually, all we need to do is connect the html page with the relevant csv, and set the right threshold (some post-process effort).

All of the above, does not need any large file parsing, but ongoing appending \ parsing in very reliable chunks (note: again chunk means pidstat cmd execution).

portante commented 5 years ago

[From an email from @atheurer]

@mrsiano, I have been playing with some optimizations during pidstat execution. There are two:

  1. pidstat broken out into per-PID-files at runtime
  2. within these files, stats that do not change in value are not written to the file again -- if every single value is the same, the line commas are not written for that line

These are done in a perl script which it's STDIN is the STDOUT from pidstat. Here is an example of a per-PID file:

1547645012,0.00,0.00,0.00,0.00,0.00,3,0.00,0.00,13216,812,0.00,0.00,0.00,0.00,0,1.94
1547645013,,,,,,1,,,,,,,,,,4.00
1547645014,,,,,,0,,,,,,,,,,3.00
1547645015,,,,,,1,,,,,,,,,,2.00
1547645016,,,,,,0,,,,,,,,,,3.00
1547645017,,,,,,1,,,,,,,,,,5.00
1547645018,,,,,,0,,,,,,,,,,3.00
1547645019,,,,,,,,,,,,,,,,5.00
1547645020,,,,,,,,,,,,,,,,3.00
1547645021,,,,,,,,,,,,,,,,4.00
1547645022,,,,,,,,,,,,,,,,3.00
1547645023,,,,,,1,,,,,,,,,,
1547645024,,,,,,0,,,,,,,,,,
1547645025,,,,,,4,,,,,,,,,,5.00
1547645026,,,,,,0,,,,,,,,,,3.00
1547645027
1547645028,,,,,,,,,,,,,,,,2.00
1547645029,,,,,,1,,,,,,,,,,3.00
1547645030,,,,,,0,,,,,,,,,,
1547645031
1547645032
1547645033,,,,,,1,,,,,,,,,,
1547645034,,,,,,0,,,,,,,,,,

This reduces the space needed for pidstat by over 90%.

We'll need to update the post-processing to interpret this and process 1 pid at a time, and also fix GenData.pm and the graphing scripts to use the file format we talked about (1 data series per line).

@portante, since we have a "begin" and "end" in the CDM metric docs, and we'll know if a stat did not change for X time, this will significantly reduce the number of docs for pidstat. We should use the same approach for any other metrics we have, where contiguous samples have the same value, we submit 1 metric with a "begin" and "end" to match.

portante commented 5 years ago

[From an email from @mrsiano]

Hi @atheurer,

That's exactly what I introduced but broken out into file per metric. I did some tests, and it works for me too.

We can take this into an advanced stage for the long term and per what I mentioned in bullet 3.

atheurer commented 5 years ago

Don't we always want to keep the agent's dead simple?

So if we have a simple agent that just collects and does not process data, can't we provide to those that don't have a server side solution a way to do what the server would do locally so that we don't make the agent complicated for all?

I think part of the problem here is that the data size for native pidstat output is so large and inefficient that having at least some minor refactoring of the data at collection time is warranted. And so all I would like to accomplish is getting data organized by PID, not combined, and I would like to eliminate the redundant data (subsequent samples with same value). , I do -not- want to do any post-processing that is considered for graphing during the pidstat execution. And so a simple pipe of pidstat to a output-refactor script is all I want.

I would prefer we not fork+exec pidstat for every single sample collection. This seems inefficient from cpu perspective. Similarly, I would prefer we not open and close a bunch of files for each sample.

Regarding @mrsiano point number 3, we do not want any special external parser server for this. We want all post-processing contained within pbench-agent (today) and pbench-server (future). Our datastore is CSV on web-server (today), and Elasticsearch (future).

atheurer commented 5 years ago

[From an email from @mrsiano]

Hi @atheurer,

That's exactly what I introduced but broken out into file per metric. I did some tests, and it works for me too.

Is your output per-PID-per-metric? Or just per-metric? I choose per-PID because the chunks are then much, much smaller that per-metric, and so the memory needed in the post-processing will be very small.

Are you eliminating the redundant data?

portante commented 5 years ago

@atheurer, the approach to process pidstat data into separate .csv files per-PID seems like the best approach to take. It avoids super large datasets, and fits with the way the metrics are going to be stored long term, all associated per-PID, since that is the identifier of the entity from which the metrics are gathered.

The one question about pidstat data which we might want to consider is what happens when PIDs get re-used over time? How will that affect this approach?

mrsiano commented 5 years ago

@atheurer in your example you capture only the pid without the command. we might want to do something like this: https://gist.github.com/mrsiano/cc01500b9bc0ff7465ac9445ace52699#file-pidstat-toolsctip-chunking-sh-L17

some small formatting, which runs by 0 time. and provides us a stdout file in a good shape for post-processing. also, looks like this is what we are doing today (see the column names). https://github.com/distributed-system-analysis/pbench/blob/master/agent/tool-scripts/postprocess/gold/pidstat/csv/cpu_usage_percent_cpu.csv