awslabs / damo

DAMON user-space tool
https://damonitor.github.io/
GNU General Public License v2.0
148 stars 28 forks source link

The heatmap is too dark #88

Closed Yitrus closed 4 months ago

Yitrus commented 6 months ago

Hi, I have sampled approximately 20 gigabytes of physical addresses. I have attempted to modify the sampling range and time intervals, but the resulting heat maps are mostly black, with the highest access count being 1. I would like to inquire if there are any techniques or recommendations for optimizing the combination of these parameters. Your expertise on this matter would be greatly appreciated. Thank you!

sj-aws commented 6 months ago

Hi Yitrus,

We don't have formal guidelines or techniques for such tuning at the moment. We want to make it in future, though. Let me share you my humble suggestions.

I think there could be three possible reasons for the dark heatmap.

  1. Your workload is not really making the access

If your workload is not really making access of your expected amount, the dark heatmap might just be the truth. You could test if this is the case by further investigating your workload's source code, or trying artificial workloads having clear access pattern, like the 'getting started' section on README of this repo is using masim.

  1. Your workload is having only small amount of working set

Current implementation of DAMON checks access via page table's accessed bit. CPU might not set the bit if the access can be completed without accessing the page table. For example, if the access was made with TLB hit, the page table might not need to be accessed. Hence, if your workload is having only small amount of working set and therefore could complete almost every access without page table access, DAMON may not find the access and therefore result in the dark hetmap. You could check if this is the case by increasing the working set of your workload, or using artificial access pattern workload.

  1. Your workload is slowly accessing the memory compared to your parameter

If none of the above is truth, your aggregation interval might be too small. Suppose your workload has 5 GiB memory region and repeatedly access every bytes of it. The access speed is 1 second per 1 GiB. For example,

byte regions[5][1024*1024*1024];

while (1) {
    for (i = 0; i < 5; i++) {
        access_all_bytes_once(regions[i]);   /* this takes 1 second */
    }
}

If your sampling interval is one second and aggregation interval is five seconds, the heatmap will show the 5 GiB region as having access count 1. E.g.,

11111
11111
11111
11111

If your sampling interval is one second and aggregaion interval is ten seconds, the access count may increase, E.g.,

22222
22222

If your aggregation interval is less than four seconds, the heatmap will show many regions of access count zero.

You could check if your aggregation interval is not properly set by testing multiple aggregation intervals and show if it makes some changes.

Yitrus commented 6 months ago

Hi SeongJae,

Thank you for your valuable advice; your explanation was very clear. I followed your suggestion to increase the aggregation interval, and as a result, the heatmap appears less dark than before.

I took the opportunity to compare the heatmaps generated from physical and virtual page sampling using masim. Anticipatedly, I observed some similarities. I selected a System RAM region (corresponding to my NUMA node 0) of approximately 20G size from /proc/iomem . However, when sampling using the --regions option, the vertical coordinate of the resulting heatmap is only 10G.

May I inquire about the reason behind this? Is it because roughly half of the addresses were not accessed, resulting in the heatmap displaying only a portion of the region? Or are there some pages that cannot be accessed, leading to only half of the data being represented in the heatmap?The occurrence of roughly half might be coincidental, and I plan to conduct further experiments to explore this phenomenon.

sj-aws commented 6 months ago

Hi Yitrus,

the heatmap appears less dark than before

Glad to hear that my humble comments was somewhat useful.

the vertical coordinate of the resulting heatmap is only 10G.

Interesting... Could you please share detailed reproduction steps of the issue including the specific commands you used? Also, could you please run damo status at least once while the monitoring/recording is ongoing, and share the output?