problem with memory counting method ?

mvesin commented 3 years ago

Hi all, thanks for this great tool.

I observed strange results using experiment-impact-tracker, with a rapl_power_draw_absolute < rapl_estimated_attributable_power_draw

I suspect a problem in the attributable memory counting method :

attributed memory for a process is USS + PSS
while it seems to me that it should only be PSS (which includes USS already) https://man7.org/linux/man-pages/man8/smem.8.html

Can someone confirm / correct this statement ?

>>> from experiment_impact_tracker.compute_tracker import ImpactTracker
>>> import os
# add a breakpoint in code with : import pdb; pdb.set_trace()
>>> import experiment_impact_tracker.cpu.intel
>>> experiment_impact_tracker.cpu.intel.get_intel_power([42954])
{'rapl_power_draw_absolute': 51.95493564262022, 'rapl_estimated_attributable_power_draw': 57.03725903291969, 'cpu_time_seconds': {42954: OrderedDict([('user', 2599.63), ('system', 444.91), ('children_user', 0.17), ('children_system', 4.66), ('iowait', 0.0)])}, 'average_relative_cpu_utilization': 0.9966920738858607, 'absolute_cpu_utilization': 2.055019728531582, 'relative_mem_usage': 1.470720815077694, 'absolute_mem_usage': 13372609536.0, 'absolute_mem_percent_usage': 0.1341320368485436, 'mem_info_per_process': {42954: OrderedDict([('rss', 6764167168), ('vms', 46901260288), ('shared', 1675501568), ('text', 2330624), ('lib', 0), ('data', 42889551872), ('dirty', 0), ('uss', 6659334144), ('pss', 6713275392), ('swap', 0)])}}
>>>

Shouldn't relative_mem_usage be <= 1 ? Tracing further :

(Pdb) p system_wide_mem_percent
1.4931055744647492
(Pdb) p total_physical_memory
svmem(total=99697356800, available=90187993088, percent=9.5, used=7801208832, free=372105216, active=6481403904, inactive=90540580864, buffers=123908096, cached=91400134656, shared=1138413568, slab=670351360)
(Pdb) p mem_info_per_process
{42954: OrderedDict([('rss', 7177113600), ('vms', 46932361216), ('shared', 1675501568), ('text', 2330624), ('lib', 0), ('data', 42920652800), ('dirty', 0), ('uss', 7072268288), ('pss', 7126215680), ('swap', 0)])}

I patched locally and results look much more like expected (attributable power draw slightly under absolute power draw).

Tested on :

Dell R740, dual Xeon Silver 4110, 3x Nvidia Tesla T4, 96GB, CentOS 7.6
Dell T640, dual Xeon Silver 4215, 4x Nvidia Quadro RTX6000, CentOS 7.6
both case : git clone of branch master

Breakend commented 3 years ago

This is totally correct. Apologies for the error. We've added a check in #42 which prevents this from happening silently again and fixed the USS + PSS logic to only rely on PSS if the system supports PSS.

Thanks so much for raising this issue.

Chris-Peterson444 commented 2 years ago

Hi, I'm getting a similar error where total_intel_power < total_attributable_power.

Trace as requested:

Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/utils.py", line 68, in process_func raise e File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/utils.py", line 62, in process_func ret = func(q, args, kwargs) File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/compute_tracker.py", line 161, in launch_power_monitor _sample_and_log_power(log_dir, initial_info, logger=logger) File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/compute_tracker.py", line 108, in _sample_and_log_power results = header["routing"]["function"]( File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/cpu/intel.py", line 88, in get_intel_power return get_rapl_power(pid_list, logger, kwargs) File "/home/chrisp44/.local/lib/python3.8/site-packages/experiment_impact_tracker-0.1.9-py3.8.egg/experiment_impact_tracker/cpu/intel.py", line 564, in get_rapl_power raise ValueError( ValueError: For some reason the total intel estimated power is less than the attributable power. This means there is an error in computing the attribution. Please re-open https://github.com/Breakend/experiment-impact-tracker/issues/38 and add the trace for this warning.

This seems to be due to the fact that cpu_percent ends up being slightly greater than 1 (in my case it's usually around 1.005).

My quick local fix is to just cap power_credit_cpu to 1.0, although it doesn't fix the potential over-estimation problem. However, this only seems to happen when there's not much else running on the machine.

Breakend / experiment-impact-tracker

problem with memory counting method ? #38