StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
682 stars 145 forks source link

Communication and memory access overhead #1561

Open luo-junpeng opened 1 year ago

luo-junpeng commented 1 year ago

In addition to the core computing function, when a multi-node task is executed, there are communication between nodes and memory access overheads. How to calculate and measure these two overheads?

lightsighter commented 1 year ago

There are no such things as "multi-node tasks" in Legion. Each task runs on a single processor on one node somewhere. Legion will perform communication as necessary to maintain the sequential semantics of the application. The best way to measure that needed communication is with the Legion Profiler.

luo-junpeng commented 1 year ago

@lightsighter Yes, but how do you calculate the communication overhead when tasks on different nodes need to communicate? Legion seems to use MPI for cluster communication

luo-junpeng commented 1 year ago

@lightsighter Legion Profiler seems to only see the start, end and execution time of a task, I need to see the communication time of the task.

rohany commented 1 year ago

In general, tasks themselves do not perform any communication, as each task runs on a single processor. Communication done by the runtime system occurs outside of tasks, and the Legion Profiler records all communication performed between memories. If you want to profile the task itself, i.e. things like cache hit rate or something, you should use a profiler like Nsight systems.

elliottslaughter commented 1 year ago

It sounds like @luo-junpeng would be interested in viewing the channels in Legion Prof. By "channel", I mean a communication pathway used to perform a copy of data from one instance of a region to another. (Channels also record fills of regions and certain partitioning operations.)

As @rohany noted, copies are distinct from tasks. Legion issues them before a task to ensure the task's data is valid. However, copies can be done asynchronously with other copies and tasks. Therefore, you can't really meaningfully calculate an "overhead" for communication in Legion. The only way to do that is to modify your program: e.g., you could make all your regions contain exactly one element (and see how much faster it runs that way). But these are at best approximations, and various factors can confound your measurements.

Also, just for posterity, Legion uses GASNet (https://gasnet.lbl.gov/) for communication in the vast majority of setups.

luo-junpeng commented 1 year ago

@elliottslaughter Yeah,legion_prof records copys of data from one instance of a region to another which contains the start time, stop time. I'm now subtracting stop time from start time, the difference between the two values indicates the actual data replication time of the copy operation, or the communication time and memory access time during the copy operation are included? copy_infos.emplace_back(CopyInfo()); CopyInfo &info = copy_infos.back(); info.op_id = prof_info->op_id; info.size = usage.size; info.create = timeline.create_time; info.ready = timeline.ready_time; info.start = timeline.start_time; // use complete_time instead of end_time to include async work info.stop = timeline.complete_time;

elliottslaughter commented 1 year ago

This is a question for someone who works on Realm, maybe @eddy16112.

The tricky part is that Legion (Realm) may split a copy up into multiple pieces, and each piece may go over multiple hops (especially on older machines where the GPUs aren't directly connected, Legion has to do something like GPU -> CPU -> other CPU -> GPU).

I believe that the start and end should capture the entire end-to-end copy process, but given the complexity that can be hidden inside, it may not always mean what you think.

eddy16112 commented 1 year ago

Usually, the end_time - start_time is the actual time spending in moving the entire data. Legion prof uses complete_time - start_time, which is fine. create_time is the timestamp when Legion issues the copy, and ready_time is the timestamp when preconditions are satisfied.

luo-junpeng commented 1 year ago

Yeah, @eddy16112 , the actual time spending in moving the entire data is what i want. Thank you for your help. @elliottslaughter @eddy16112 @lightsighter @rohany By the way, I'd also like to know the overhead time for accessing memory during data movement. I noticed that there is a LegionProfInstance::process_mem_desc function in the legion_profiling.cc file. ` void LegionProfInstance::process_mem_desc(const Memory &m) { if (m == Memory::NO_MEMORY) return; if (std::binary_search(mem_ids.begin(), mem_ids.end(), m.id)) return; mem_ids.push_back(m.id); std::sort(mem_ids.begin(), mem_ids.end());

  mem_desc_infos.emplace_back(MemDesc());
  MemDesc &info = mem_desc_infos.back();
  info.mem_id = m.id;
  info.kind  = m.kind();
  info.capacity = m.capacity();
  const size_t diff = sizeof(MemDesc);
  owner->update_footprint(diff, this);
  process_proc_mem_aff_desc(m);
}

`

eddy16112 commented 1 year ago

@luo-junpeng how do you define the overhead time for accessing memory? Regarding the process_mem_desc, if I remember correctly, it is used to record the description of a memory such as size, kind of the memory (e.g. system memory, gpu framebuffer memory, and etc.).

luo-junpeng commented 1 year ago

@eddy16112 It can be understood as two points:

  1. Memory access time from one IP address to another
  2. Time: memory -> cache -> cpu or memory -> cpu -> cache It is not known whether the memory access overhead of this block is recorded in legion_prof, or where it should be recorded if it is not.
eddy16112 commented 1 year ago

@eddy16112 It can be understood as two points:

  1. Memory access time from one IP address to another
  2. Time: memory -> cache -> cpu or memory -> cpu -> cache It is not known whether the memory access overhead of this block is recorded in legion_prof, or where it should be recorded if it is not.

I think you definition of overhead (Memory access time from one IP address to another) is similar to the cost of data movement, which is end_time - start_time. Due to performance issue, we do not encourage people to use remote accessible memory such as the global memory in GASNet, so most of the applications will map Legion logical region into local memories, and Legion will take care of the data movements between memories. @lightsighter correct me if I am wrong.

lightsighter commented 1 year ago

Yes, that's accurate. In general, we encourage you to either move tasks to data or data to tasks so that they are colocated and you aren't doing small individual reads/writes as RDMA operations over the network. You can map instances into one of the "global" memories like the GASNet memory and use a GenericAccessor to do remote reads/writes over the network, but I think you'll find that to be fairly slow. In general, it's much better to do bulk data movement as it's more efficient for the network and ultimately results in better throughput.

luo-junpeng commented 1 year ago

@elliottslaughter I noticed that there is a copy message about the channel, but what does this "direct copy" mean? image

According to the previous understanding, “copy” refers to the data replication information between nodes. The following figure shows the total time, which is obtained by end-start. Is this the actual data replication time? image

elliottslaughter commented 1 year ago

An indirect copy is something else. That is what we usually call a "gather/scatter" copy, i.e., the source and destination locations are indirected through a third region of indices.

What you want to look for is num_hops. I don't see that in your screenshot, so maybe you're on a version of Legion before we fixed this. I believe a recent copy of master or control_replication should tell you how many hops the copy went through. In the case where num_hops=1 then there is only a direct copy and the timing should be accurate.

luo-junpeng commented 1 year ago

ok,thank you @elliottslaughter Excuse me, I have one more question:As shown in the following figure, the system memory on node1 also has a total time indicator. Does this indicator refer to the time for accessing the system memory on node1? 图片1

lightsighter commented 1 year ago

As shown in the following figure, the system memory on node1 also has a total time indicator. Does this indicator refer to the time for accessing the system memory on node1?

No, that is the lifetime of a particular instance of a logical region in that memory from the point where it was created to the point that is was deleted. Multiple different tasks could (and likely) have used that instance during the course of its lifetime.

luo-junpeng commented 1 year ago

ok,thank you @lightsighter For memory access overhead, I want to measure the cache hit ratio (such as L1, L2, and L3) of the core computing phase (excluding the runtime part) of a task. Which source code can solve my problem? Is there any third-party tool that can solve this problem?

lightsighter commented 1 year ago

You can actually ask Legion to do that through the mapping interface. If you ask for cache profiling requests when you map a task then Legion will report back the statistics for the execution of that task after the task is done running.

eddy16112 commented 1 year ago

@lightsighter I think the cache profiling with PAPI is not tested for years. I feel like it is no longer working.

lightsighter commented 1 year ago

Hmm, ok, well at least that is how it should work...

luo-junpeng commented 1 year ago

haha, I want to integrate cache profiling with PAPI into legion_prof and visualize it on the profiling web page. I observed the use of PAPI in runtime/realm/threads.cc, but I don't have a clue yet.

luo-junpeng commented 1 year ago

By the way, in the visualization of legion_prof, there is a utility indicator. What is the specific meaning of this?

elliottslaughter commented 1 year ago

Legion (by default) reserves one core for doing runtime analysis.

luo-junpeng commented 1 year ago

@eddy16112 @lightsighter I noticed that there was a message in realm/profiling.h. image I used "L1DCachePerfCounters" metric, image but it doesn't work, "ProfilingResponse resp" does not seem to receive L1DCachePerfCounters response, therefore, the if judgment in the following figure is false. So how do I make the cache information take effect? image

luo-junpeng commented 1 year ago

@lightsighter I think the cache profiling with PAPI is not tested for years. I feel like it is no longer working.

By the way, in which underlying source code is cache profiling with PAPI? I only observed the use of PAPI in runtime/realm/threads.cc

lightsighter commented 1 year ago

The profiling should all be handled by Realm including anything having to do with PAPI counters. If it is not returning a meaningful results then we should create a separate issue for that, although I can't promise when it will be fixed.

luo-junpeng commented 1 year ago

I went through layer-by-layer analysis of the profiling data, and I ended up here. image But I don't know where the profiling_runtime_task parameter came from, which is, the source of the profiled data seems to be unknown. Just a dictionary is defined here, and no parameters are passed. image

lightsighter commented 1 year ago

That is the callback that Realm invokes to Legion whenever a profiling response is ready. From there it can dispatch to many different places. I wouldn't worry too much about trying to debug this yourself. If you can make a small reproducer showing what you think should work with profiling responses coming back to your paper for PAPI counters then we can try and debug that.

luo-junpeng commented 11 months ago

[/Unified-Legion/runtime/legion/legion_profiling.cc] As shown in the following figure, I added a ProfilingMeasurements::OperationMemoryUsage to the add_task_request function. image Then when I try to receive ProfilingMeasurements::OperationMemoryUsage via response in the process_task function, I find it unsuccessful, as shown in the figure, the if judgment is false. What should be the correct steps I should take? image

lightsighter commented 11 months ago

I don't think Realm tasks will have memory usage because Realm doesn't have any concept of which tasks are associated with instances. The Realm::ProfilingMeasurements::OperationMemoryUsage will only provide a profiling response for data movement operations or dependent partitioning operations that actually consume memory.

luo-junpeng commented 11 months ago

@lightsighter ,thank you,yes, I'm just giving an example here, I'm actually wondering what measurements the process_task (and of course others like process_copy, process_fill, process_mate) supports, or I'm adding a ProfilingMeasurement and receiving it through ProfilingResponse. Do I change the other source code between these two steps so that ProfilingResponse can successfully receive the ProfilingMeasurement information?

lightsighter commented 11 months ago

I guess I'm confused what you are trying to do. Are you trying to change the mapper to get profiling responses? In general you shouldn't (need to) be modifying runtime code like this. In general I would encourage you to try to use the either the mapper interface or the offline profiler to answer queries about memory usage for tasks.