StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 145 forks source link

Legion: S3D poor weak scaling performance on Frontier #1699

Closed syamajala closed 1 month ago

syamajala commented 3 months ago

Ive done another set of runs of S3D on Frontier and am seeing poor weak scaling performance: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/weak_scaling.html

Legion prof archives are in the pwave directories: https://legion.stanford.edu/prof-viewer/?url=https://sapling.stanford.edu/~seshu/s3d_ammonia/pressure_wave/

4 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_4_ammonia/legion_prof/

2048 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_2048_ammonia/legion_prof/

In the 2048 node profile it seems like there is some gap after the AwaitMPITaskEarly where it doesnt look like there is much going on? Look at 293.58 - 298.82 seconds for example.

lightsighter commented 3 months ago

There is blocking in your top-level task that is preventing the runtime from getting ahead. I suspect you haven't adjusted your mapper to correctly cope with the fixed frame code, but you could also be waiting on a future.

syamajala commented 3 months ago

I did not see anything in #1680 about needing to update the mapper? I will talk to @elliottslaughter.

rohany commented 3 months ago

Are you sure it's blocking? It doesn't look like that to me (or at least S3D is pushing out a full iteration and then stopping). On 4 nodes, it takes 300ms for all operations in the trace to make it through the mapping stage of the pipeline, while on 2048 nodes it takes 3 seconds for that to happen. While it would be nice for the application to be farther ahead, that still seems like a problem.

elliottslaughter commented 3 months ago

Before Mike fixed #1680, we were running about 2× the requested number of frames in advance. Now that Mike has fixed that bug, we should probably double our min_frames_to_schedule and max_outstanding_frames values, since the runtime is now much more accurate about following what we ask for.

lightsighter commented 3 months ago

Are we sure that all the task launches in this program are index space task launches that span the whole machine? There are no individual task launches being done right (unless they are for future operations)?

syamajala commented 3 months ago

Yes every single task launch in S3D should have either __demand(__index_launch) or __demand(__constant_time_launch) on it.

syamajala commented 3 months ago

I doubled the number of frames and it doesnt seem like it made much of a difference?

I only ran 2048 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pwave_x_2048_ammonia/legion_prof/

lightsighter commented 3 months ago

That profile does not seem like it wants to load for me.

All the index launches span the entire machine?

syamajala commented 3 months ago

Do we have some way in regent or the runtime to actually verify this?

elliottslaughter commented 3 months ago

Regent doesn't know anything about how big the machine is, and the static analysis is nontrivial.

Something like the LoggingWrapper would report the sizes (and mapping) of index launches. Note that there will be extreme performance degradation from running with it, so this is for debugging purposes only.

syamajala commented 3 months ago

Well I think the first think I want to check is that there are no single task launches and that every task is in fact being index space launched.

syamajala commented 3 months ago

I guess there is one:

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/s3d.rg?ref_type=heads#L1471

https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/mpi_tasks.rg?ref_type=heads#L86-93

elliottslaughter commented 3 months ago

@lightsighter you can manually load the profile with:

legion_prof --attach http://sapling.stanford.edu/~seshu/s3d_ammonia/pwave_x_2048_ammonia/legion_prof/

Why are you asking about the index launches being across the entire machine?

Seshu's links above go to a task that is called once per timestep to fetch the timestep information. I think we have arranged this to not actually block on MPI 90% of the time. Therefore, the vast majority of these cases should give Legion plenty of time to do the reduce/broadcast on the futures.

I believe the index launches themselves should be across the entire machine.

syamajala commented 3 months ago

It seems to be the complete_frame call that is causing the main task to block.

Here is a profile on 1 node with it commented out: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/legion_prof.3/

lightsighter commented 3 months ago

Ok, but that only happens every N iterations. The profile looked like it was blocking multiple times each iteration so something else has to be blocking as well.

syamajala commented 3 months ago

That was the only thing I changed.

lightsighter commented 3 months ago

And what happens if you switch back to non-frame execution?

syamajala commented 3 months ago

Still waiting on the 8192 node run but have up to 4096 nodes.

No frames: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/weak_scaling.html

No frames profile at 4 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_4_ammonia/legion_prof/

No frames profile at 2048 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_2048_ammonia/legion_prof/

With frames: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/weak_scaling.html

With frames profile at 4 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/pwave_x_4_ammonia/legion_prof/

With frames profile at 2048 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/pwave_x_2048_ammonia/legion_prof/

lightsighter commented 3 months ago

I can't see the profiles. They're not loading. Are the permissions set correctly?

Is there a reason you ran them so large? I would expect to see the difference in waits even on a small number of nodes.

What happens if you grow the number of frames? Do you see the waits spread out?

syamajala commented 3 months ago

I am able to view the profiles. There are profiles for smaller node counts available in that directory as well in the pwave_x directories.

1 - 4096 nodes: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/

You can try using attach as well: legion_prof --attach http://sapling.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_2048_ammonia/legion_prof/

The only reason I ran it so large is because we have hours to burn, the ALCC allocation expires at the end of June and we didn't use all of it.

I can try running again with frames and use more of them.

lightsighter commented 3 months ago

Something changed very dramatically in the four node runs with frames. The main task is not blocking at all in these runs. It is gone before we even start running anything as if we unrolled the whole main task. That doesn't appear to be happening in the old version. What did you set the mapper frame runahead to be?

lightsighter commented 3 months ago

Also, just looking at these profiles, the copies just look like they are taking longer from the old to the new.

syamajala commented 3 months ago

The original run with frames min_frames_to_schedule was 1 and max_outstanding_frames was 2. In this case 1 frame is 10 timesteps.

I did try min_frames_to_schedule = 2 and max_outstanding_frames = 4 at some point, where 1 frame is 10 timesteps, but it did not look any different to me.

It looks like Frontier is down so cant do any runs today.

lightsighter commented 3 months ago

I don't see any difference on the Legion side of things at scale. The trace replays are happening and they are taking the same amount of time to replay the traces. There's very little runtime overhead. Whatever is not scaling, it is not Legion's fault.

syamajala commented 2 months ago

By playing with core and nic bindings so we use all 4 nics we were able to get a bump in performance in the 4 and 8 rank/node cases: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave_case_weak_scaling.html

We are still hitting OOM at 4096 nodes when running 4 ranks/node though.

syamajala commented 2 months ago

Theres two things Ive noticed in these profiles: http://sapling2.stanford.edu/~seshu/s3d_scaling/4ranks/

First the gap between the timesteps seems to be growing as we scale, which must mean something is happening on the fortran side?

Second, once we come back to legion, there is a weird gap that is getting bigger as we scale before the timestep starts executing leaf tasks. As best as I can tell there is nothing going on, but I would probably need to generate a profile for every node in the run to be sure.

You can see it in this profile from 349.3 - 350.5 seconds: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_scaling/4ranks/pwave_x_1024_ammonia/legion_prof/

syamajala commented 2 months ago

@lightsighter is there a way for me to instrument the fortran side with legion prof so we can get a box in the profile to be 100% sure when the time is being spent on the fortran side? its very annoying right now having to guess and eyeball by looking at gaps in the profile. it does not need to be super detailed and probably something that works based on the handshake would probably be enough?

syamajala commented 2 months ago

I think this makes more sense to me now because we changed how we make the scaling plot. If I go back to the old method just based on using times from legion_prof, just using the few profiles I took it looks much better to me. I would need to do the full set of runs with legion_prof to be sure.

lightsighter commented 2 months ago

is there a way for me to instrument the fortran side with legion prof so we can get a box in the profile to be 100% sure when the time is being spent on the fortran side? its very annoying right now having to guess and eyeball by looking at gaps in the profile. it does not need to be super detailed and probably something that works based on the handshake would probably be enough?

I'm not sure if this is useful or not but there are user-level profiling ranges now: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion.h?ref_type=heads#L8531-8543 They do have to be associated with a Legion task though and they will appear like a function call inside that task when it is rendered by the profiler. That won't quite allow you to render a time range for an "external processor" (we would need extra support for that), but maybe it let's you see where a Legion task is waiting for something.

syamajala commented 2 months ago

I think I would need to call start_profiling_range in the handoff task and stop_profiling_range in the await task? So it doesnt seem like this would work if the calls need to be within the same task.

Also start_profiling_range and stop_profiling_range dont appear to be available in the C api.

lightsighter commented 2 months ago

I think I would need to call start_profiling_range in the handoff task and stop_profiling_range in the await task? So it doesnt seem like this would work if the calls need to be within the same task.

Yes, they do need to be in the context of the same task currently. I don't have a way to draw arbitrary boxes in the profile right now. We'd probably need to discuss doing that in a Legion meeting since it would require more significant changes to the profiler.

Also start_profiling_range and stop_profiling_range dont appear to be available in the C api.

Probably not right now, but you can add it.

syamajala commented 1 month ago

Here is what we get when we only look at legion: http://sapling2.stanford.edu/~seshu/s3d_scaling/average_21to30_iteration.html

Based on the legion profiles here is what I believe to be the time being spent on the fortran side between handoffs: http://sapling2.stanford.edu/~seshu/s3d_scaling/fortran_handoff.html

Shortly after we gave up on Gordon Bell I started porting the low pass filter to Regent. I finished the implementation and it was working in 2 out of the 3 test cases I had, but the last one was crashing. There is still some correctness bug somewhere, but if we can get that to work then we can avoid doing the fortran handoff every 10 timesteps and only do a handoff for check pointing.

I'm going to close this issue and open a separate one for the profiling request above.