Can we make "damo status" faster?

honggyukim commented 1 year ago

I sometimes want to monitor the output of damo show, but I feel it is quite slow.

From my experience, it takes around 10 seconds, but I'm just wondering if it's possible to make it faster.

$ time sudo ./damo status
      ...
real    0m10.172s
user    0m0.001s
sys     0m0.007s

honggyukim commented 1 year ago

I see that the reason was that I increased the number of regions from 10 to 100 or 1000.

However, rolling back to 10 regions also takes more than 5 seconds. I still feel that it'd be helpful if it can be faster.

honggyukim commented 1 year ago

In addition, I see that many "tried regions" although they are going to be filtered out by cgroup filter.

Can we also hide such filtered out regions from "tried regions" in damo show output?

sj-aws commented 1 year ago

Hi Honggyu, thank you for this report.

Expected high overhead mechanism

damo show uses DAMON sysfs interface's DAMOS tried regions feature[1]. In detail, damo show ensure there are at least one monitoring DAMON scheme (DAMON scheme that having stat as the action, and [min, max] as all access pattern ranges) per context (if there is a context not having it, install one monitoring scheme), and asks DAMON to expose the detailed information of the monitored regions via DAMOS tried regions feature. DAMOS tried regions feature exposes the information by creating one directory and four files that having the information, per the tried region. Hence, the kernel part operation is assumed to impose high overhead and take long time if the number of tried regions is large, since it has to create large number of directories and files.

User level overhead control

Users can control the overhead using access pattern based damo show results filtering options including --access_rate, --age, and --sz_region. damo passes the information to DAMON so that DAMON doesn't spend unnecessary time at creating the files that the user doesn't have interest. Therefore, good use of the options will allow you minimize the overhead and get the damo show output faster. For example, you could ask damo show to show only hot regions, or regions of specific range of hotness.

Also, note that --total_sz_only option of damo show avoids DAMON creating all the directories and files, from kernel v6.6-rc1. If you have interest in only total size of regions of specific access pattern (e.g., total size of regions that not accessed for more than 5 minutes), you could use that to get the information faster.

DAMON level optimization ideas

We having a few kernel level optimization ideas for better use of the feature, though. Two of those are for allowing users navigate DAMON monitoring results like they do with some online map service like Google Maps.

The first idea is to let users know how many tried regions exist at the moment, so that users can avoid using the feature when the number is too large, or modify the show target access pattern so that only small number of regions will be captured.

The second idea is to let users set the resolution of the information. That is, users will be able to set the total number of regions that will be exposed their information via the feature. Then, if the user-defined number is smaller than the number of real tried regions, DAMON will collapse some of the regions for the report. As a result, the quality of the information will be degraded, but the tried regions directories/files creation will also be decreased.

Using the two features, damo users will be able to control the resolution and specific area to show the monitoring results, like we show low-resolution overall picture from map, and then zoom in/out to proper region of interest with Google map-like products. This may take some time, though.

DAMO level faster solution

The tried regions feature is not the only wat to get the monitoring results. DAMON also provides tracepoints, which doesn't require creation of the files. Maybe we can think about making a new option of damo show that allows users to ask damo show to use the DAMON tracepoints instead of the tried regions feature. It could be slower than damo show under small number of regions since it would need to enable/capture/disable tracepoints, though, especially since current DAMO implementation is using perf inside. I expect it would take about 3-5 seconds in general, but wouldn't increase too much unlike damo show under larget number of regions.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/mm/damon/usage.rst?h=v6.6-rc2#n422

sj-aws commented 1 year ago

Can we also hide such filtered out regions from "tried regions" in damo show output?

cgroup and backing-content type based DAMOS filters work in page granularity, while DAMON regions are defined as address range. Since such hiding would impose significant overhead. The address range and monitoring target based DAMOS filters patchset[1] may give you the detail.

So, making such feature would be possible, but I cannot get an idea for efficient implementation of it at the moment. So I'd like to recommend you looking for other options.

[1] https://lore.kernel.org/damon/20230802214312.110532-1-sj@kernel.org/

honggyukim commented 11 months ago

Hi SeongJae,

I'm sorry for the late response again. I had read your detailed explanation, but took a bit of time to digest all the comments.

Expected high overhead mechanism

I now understand that the high overhead is expected to bring up the information through tried regions via sysfs.

User level overhead control

Yeah, that would be another good option.

DAMON level optimization ideas

The first idea is to let users know how many tried regions exist at the moment, so that users can avoid using the feature when the number is too large, or modify the show target access pattern so that only small number of regions will be captured.

That would be a good idea. I sometimes wanted to know only the number of tried_regions for the given DAMOS action.

The second idea is to let users set the resolution of the information. That is, users will be able to set the total number of regions that will be exposed their information via the feature. Then, if the user-defined number is smaller than the number of real tried regions, DAMON will collapse some of the regions for the report. As a result, the quality of the information will be degraded, but the tried regions directories/files creation will also be decreased.

That would also be good, but I don't have a clear idea how to properly collapse the information.

This may take some time, though.

Sure. I don't expected it being supported in the near future.

DAMO level faster solution

Having tracepoints will also be a good option.

cgroup and backing-content type based DAMOS filters work in page granularity, while DAMON regions are defined as address range. Since such hiding would impose significant overhead.

Thanks. I get that there is no way to filter out before scanning each pages inside regions.

The address range and monitoring target based DAMOS filters patchset[1] may give you the detail.

I remember this was implemented based on my request at https://github.com/awslabs/damo/issues/65#issuecomment-1656379106.

Thanks very much for your help and explanation as always.

sjp38 commented 11 months ago

Hi Honggyu,

Thank you very much for your valuable feedback as always.

That would be a good idea. I sometimes wanted to know only the number of tried_regions for the given DAMOS action. [...] That would also be good, but I don't have a clear idea how to properly collapse the information.

I'll prioritize the number of tried regions and resolution-based collapsing implementations.

The resolution-based collapsing would be somewhat similar to damo report heats. We split the region by the user-specified resolution, and merge regions in each cell.

Having tracepoints will also be a good option.

The DAMOS tried regions tracepoint is also now implemented, and the patches are merged in mm tree. The support from damo record is also implemented (https://github.com/awslabs/damo/commit/d5814668). There is no automated test for that yet, though.

I remember this was implemented based on my request at https://github.com/awslabs/damo/issues/65#issuecomment-1656379106.

You're correct. It's in the mm tree. Hopefully, that will be merged in Linux v6.7.

honggyukim commented 11 months ago

Hi SeongJae,

I replied after 3 weeks of your answer, but you replied right after my comment. :)

I'll prioritize the number of tried regions and resolution-based collapsing implementations.

I would like to say that you don't have to take this request too seriously. It's just my wish list but not a seriously important request to be honest.

I rather need to have more serious and important feature in DAMON, but I need to talk to my colleagues first.

Besides that, this damo project is getting more and more important to our project so I feel grateful for your persistent work and support for this useful project.

sjp38 commented 11 months ago

No problem at all. Please feel free to ask for new features and prioritize your requests as needed. We want this tool to be somewhat useful for real users like you :)

Couldn't be happier than hearing that you think it is somewhat useful.

honggyukim commented 11 months ago

We want this tool to be somewhat useful for real users like you :)

Thanks. Happy to hear that! :)

honggyukim commented 11 months ago

I noticed that I mixed the usage of damo show and damo status. I feel like it'd be useful to see the current damon setting without updating tried_regions.

We may be able to provide a simple and quick usage of damo status without writing commit or update_schemes_tried_regions to status, then it'd be really quick. It could be provided as an additional option.

sj-aws commented 11 months ago

Nice idea, agreed to all the points. I will work on it.

sj-aws commented 11 months ago

Hi Honggyu,

I think your interest is only DAMOS statistics, correct? I implemented --damos_stats option of damo status[1]. It updates only scheme statistics and show it. Would it cover this needs?

[1] https://github.com/awslabs/damo/commit/4916f6433313bb7fb45f658611dcfbb36fb8ee29

honggyukim commented 11 months ago

Hi SeongJae,

Thanks for the update.

I think your interest is only DAMOS statistics, correct? I implemented --damos_stats option of damo status

I actually wanted to have the status output something like this without statistics and tried regions.

$ ./damo status
kdamond 0
    state: on, pid: 969
    context 0
        ops: paddr
        target 0
            pid: 0
            region [4,294,967,296, 18,253,611,007) (13.000 GiB)
        intervals: sample 100 ms, aggr 2 s, update 20 s
        nr_regions: [100, 10,000]
        scheme 0
            action: pageout per aggr interval
            target access pattern
                sz: [4.000 KiB, max]
                nr_accesses: [0 samples, 0 samples]
                age: [5 aggr_intervals, 9,223,372,036,854 aggr_intervals]
            quotas
                0 ns / 0 ns per max
                priority: sz 0 %, nr_accesses 0 %, age 0 %
            watermarks
                metric none, interval 0 ns
                0 %, 0 %, 0 %

But if it doesn't take much time to get statistics then I'm fine to show it together. I also think that the main time consuming part is to get tried regions info.

Besides that, I also like the output of --damos_stat and that would be useful when monitoring the status. So please keep the option. Thanks.

$ ./damo status --damos_stats
nr_tried: 5,258
sz_tried: 650.000 GiB
nr_applied: 2
sz_applied: 68.000 KiB
qt_exceeds: 0

honggyukim commented 11 months ago

By the way, I actually got an error when running damo status as follows.

$ ./damo status
Traceback (most recent call last):
  File "/home/root/damo/./damo", line 116, in <module>
    main()
  File "/home/root/damo/./damo", line 113, in main
    subcmd.execute(args)
  File "/home/root/damo/_damo_subcmds.py", line 31, in execute
    self.module.main(args)
  File "/home/root/damo/damo_status.py", line 137, in main
    update_tried_regions=(args.damos_stat == None))
AttributeError: 'Namespace' object has no attribute 'damos_stat'. Did you mean: 'damos_stats'?

I got the the previous sane result after reverting the following bad commit.

a905107f6eac6af33badf5f53161b68570621d6e is the first bad commit
commit a905107f6eac6af33badf5f53161b68570621d6e
Author: SeongJae Park <sj38.park@gmail.com>
Date:   Sat Nov 4 21:49:25 2023 +0000

    damo_status: Remove --damos_stat and --damos_stat_field options

    Remove the options in favor of --damos_stats and --damos_stat_fields.
    Hopefully there is no user of the options, so no grace period is needed.
    Will restore if someone complains.

    Signed-off-by: SeongJae Park <sj38.park@gmail.com>

 damo_status.py | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

sj-aws commented 11 months ago

I actually wanted to have the status output something like this without statistics and tried regions.

Oh, ok. Making yet another option for the purpose is also no problem. I'll implement it soon.

I actually got an error when running damo status as follows.

Ah, nice catch. Occasionally I also just found it and fixed[1].

[1] https://github.com/awslabs/damo/commit/f4b382e8d6815760c90caa541addb38feba3eab5

sj-aws commented 10 months ago

Hello Honggyu,

In short, your slow damo show might not due to the DAMOS applied regions creation overhead, but due to a long aggregation interval of your setup.

The detail is like this. damo show asks DAMON sysfs interface to update the DAMOS applied regions directory. Because DAMON sysfs interface is not saving the DAMOS applied regions information always, it should wait until DAMON snapshot is ready and therefore DAMOS apply actions to the regions. By default, the DAMON snapshot is ready for every aggregation interval, due to the sampling-based monitoring mechanism. Hence, if you set aggregation interval long, DAMON sysfs interface should wait long before starting the applied regions directory creation. Your time output, which shows nearly zero system time at the first comment of this thread, also fits with the theory.

To overcome similar issues, we implemented DAMOS apply interval[1] and its pre-requisite patchsets. DAMON sysfs interface is still waiting for one aggregation interval in the case, though. Updating it to finish as soon as applied regions directory creation is in our TODO list. I think we don't need prioritization of it at the moment though, since your issue would fixed with the above changes. If not, please let us know.

[1] https://lore.kernel.org/damon/20230916020945.47296-1-sj@kernel.org/

honggyukim commented 10 months ago

Hi SeongJae,

Thanks for your help!

your slow damo show might not due to the DAMOS applied regions creation overhead, but due to a long aggregation interval of your setup.

You're right. I'm able to see the difference for different interval setups. But it looks like it's not related to DAMOS and I can see the difference for default stat action.

Here are the examples. I just ran damo start for the default setup, which is 5ms sampling interval and 100ms aggregation interval as follows.

$ ./damo start

$ time ./damo status
kdamond 0
    state: on, pid: 1048
    context 0
        ops: paddr
        target 0
            pid: 0
            region [4,294,967,296, 9,663,676,415) (5.000 GiB)
        intervals: sample 5 ms, aggr 100 ms, update 1 s
        nr_regions: [10, 1,000]

real    0m0.423s
user    0m0.095s
sys     0m0.018s

In this case, the damo status ran really fast.

However, if I increase the interval, then I can see the damo status takes much more time.

# Stop the previous damo start
$ ./damo stop

# Start damo with 20 times longer intervals
$ ./damo start --monitoring_intervals 100ms 2s 20s

$ time ./damo status
kdamond 0
    state: on, pid: 1086
    context 0
        ops: paddr
        target 0
            pid: 0
            region [4,294,967,296, 9,663,676,415) (5.000 GiB)
        intervals: sample 100 ms, aggr 2 s, update 20 s
        nr_regions: [10, 1,000]

real    0m4.373s
user    0m0.088s
sys     0m0.022s

This time the damo status takes more than 4 seconds.

If this is also related to DAMOS because stat is also one of the DAMOS action, then I think the waiting time can be shorter when running damo status without both updating tried regions and DAMOS stat info as we already discussed as follows.

I actually wanted to have the status output something like this without statistics and tried regions.

Oh, ok. Making yet another option for the purpose is also no problem. I'll implement it soon.

Simply having --damos_stats also makes the execution time faster. The following shows the difference clearly.

$ ./damo stop

$ ./damo start --monitoring_intervals 100ms 2s 20s

$ time ./damo status --damos_stats

real    0m0.518s
user    0m0.088s
sys     0m0.015s

$ time ./damo status
kdamond 0
    state: on, pid: 1196
    context 0
        ops: paddr
        target 0
            pid: 0
            region [4,294,967,296, 9,663,676,415) (5.000 GiB)
        intervals: sample 100 ms, aggr 2 s, update 20 s
        nr_regions: [10, 1,000]

real    0m5.920s
user    0m0.083s
sys     0m0.016s

Thanks very much for your help.

honggyukim commented 10 months ago

I can clearly see the bottleneck is when writing to sysfs inside _damon_sysfs.update_schemes_tried_regions.

This is the result using uftrace tool, which I mentioned previously and the trace was recorded simply as follows.

$ uftrace record ./damo status

The following is the output that I got by running uftrace tui.

honggyukim commented 10 months ago

uftrace replay output can be shown as follows.

$ uftrace replay -t 10ms
# DURATION     TID     FUNCTION
            [  1247] | __main__.<module>() {
  18.424 ms [  1247] |   importlib._bootstrap._find_and_load();
            [  1247] |   importlib._bootstrap._find_and_load() {
            [  1247] |     damo_adjust.<module>() {
            [  1247] |       _damon_result.<module>() {
  26.960 ms [  1247] |         _damo_fmt_str.<module>();
  11.091 ms [  1247] |         _damon.<module>();
  69.217 ms [  1247] |       } /* _damon_result.<module> */
  69.873 ms [  1247] |     } /* damo_adjust.<module> */
  70.358 ms [  1247] |   } /* importlib._bootstrap._find_and_load */
            [  1247] |   importlib._bootstrap._find_and_load() {
  12.133 ms [  1247] |     damo_report.<module>();
  12.552 ms [  1247] |   } /* importlib._bootstrap._find_and_load */
            [  1247] |   main() {
            [  1247] |     _damo_subcmds.add_parser() {
  12.240 ms [  1247] |       damo_report.set_argparser();
  13.995 ms [  1247] |     } /* _damo_subcmds.add_parser */
            [  1247] |     _damo_subcmds.execute() {
            [  1247] |       damo_status.main() {
            [  1247] |         _damon.update_read_kdamonds() {
            [  1247] |           _damon.update_schemes_status() {
            [  1247] |             _damon.update_schemes_stats() {
            [  1247] |               _damon_sysfs.update_schemes_stats() {
            [  1247] |                 _damo_fs.write_file() {
 416.006 ms [  1247] |                   TextIOWrapper.__exit__();
 416.061 ms [  1247] |                 } /* _damo_fs.write_file */
 416.148 ms [  1247] |               } /* _damon_sysfs.update_schemes_stats */
 416.155 ms [  1247] |             } /* _damon.update_schemes_stats */
            [  1247] |             _damon.update_schemes_tried_regions() {
            [  1247] |               _damon_sysfs.update_schemes_tried_regions() {
            [  1247] |                 _damo_fs.write_file() {
   4.159  s [  1247] |                   TextIOWrapper.__exit__();
   4.159  s [  1247] |                 } /* _damo_fs.write_file */
   4.159  s [  1247] |               } /* _damon_sysfs.update_schemes_tried_regions */
   4.159  s [  1247] |             } /* _damon.update_schemes_tried_regions */
   4.576  s [  1247] |           } /* _damon.update_schemes_status */
   4.581  s [  1247] |         } /* _damon.update_read_kdamonds */
   4.583  s [  1247] |       } /* damo_status.main */
   4.583  s [  1247] |     } /* _damo_subcmds.execute */
   4.668  s [  1247] |   } /* main */
   4.780  s [  1247] | } /* __main__.<module> */

This is the trace result with time filter, which discards small functions that takes under 10ms.

sjp38 commented 10 months ago

Thank you for the update and the awesome uftrace output. Yes, even without a scheme, the writing would take time. Maybe we could optimize DAMON sysfs interface for the corner case. I'll take a look soon.

sj-aws commented 10 months ago

Hi Honggyu, just implemented an option[1] for this case. It shows the detailed kdamond status without scheme stats and tried regions. For example:

$ sudo ./damo start --damos_action stat
$ sudo ./damo status --damon_params
kdamond 0
    state: on, pid: 45564
    context 0
        ops: paddr
        target 0
            pid: 0
            region [4,294,967,296, 136,292,859,903) (122.933 GiB)
        intervals: sample 5 ms, aggr 100 ms, update 1 s
        nr_regions: [10, 1,000]
        scheme 0
            action: stat per aggr interval
            target access pattern
                sz: [0 B, max]
                nr_accesses: [0 samples, 3,689,348,814,741,910,528 samples]
                age: [0 aggr_intervals, 184,467,440,737,095 aggr_intervals]
            quotas
                0 ns / 0 ns per max
                priority: sz 0 %, nr_accesses 0 %, age 0 %
            watermarks
                metric none, interval 0 ns
                0 %, 0 %, 0 %

[1] https://github.com/awslabs/damo/commit/4521f95ee90c04d24eed7702c6033c29d1077970

honggyukim commented 10 months ago

Hi Honggyu, just implemented an option[1] for this case. It shows the detailed kdamond status without scheme stats and tried regions.

Hi SeongJae, I've found that the above new option --damon_params makes the damo status much faster. Thanks!

sj-aws commented 3 months ago

Hi Honggyu, unfortunately I lost some of the context of this issue. Are the issues all resolved? Or, are you waiting any answers or implementations from my side?

honggyukim commented 3 months ago

Sorry for the late response. The damo status --damon_params is much faster so we can close it.

awslabs / damo