Closed syamajala closed 10 months ago
Make me a reproducer on sapling with hanging processes.
Please do this today, you're now blocking the control replication merge to master.
I'm building on sapling right now. Having some issues though.
What issues?
Missing symbols when I run: /scratch2/seshu/legion_s3d_nscbc/Ammonia_Cases/pwave_x_1_hept/s3d.x: symbol lookup error: /scratch2/seshu/legion_s3d_nscbc//build/hept/libsum_tasks.so: undefined symbol: hijackCudaRegisterFatBinary
Make sure you rebuild your full S3D. That symbol no longer exists in the Realm CUDA hijack.
This was a fresh checkout.
That suggests that you are building against one version of Legion and then dynamically loading a different one at runtime.
I just built without cuda instead. There are processes on c0001: 481927 and 481928.
For what it's worth, the hijack bits moved into the Regent bindings: https://gitlab.com/StanfordLegion/legion/-/blob/master/bindings/regent/regent_cudart_hijack.cc
This looks like you haven't updated the mapper to use the new interface for replicating tasks because the top-level task is not control replicated. You started just one top-level task on node 0 and as a result there is no shard on node 1 to synchronize with MPI.
I implemented replicate_task in the mapper, are more changes needed?
I can see select_task_options is marking the task as replicable, but it doesnt look like replicate_task is ever getting called?
Where is the implementation of your mapper?
I have not checked them in yet, so you will have to them on sapling here: /scratch2/seshu/legion_s3d_no_cuda/rhst/rhst_mapper.cc
I mostly just moved the parts that seemed relevant from map_replicate_task to replicate_task.
It looks like you're not calling Runtime::set_top_level_task_mapper_id
to ensure your mapper gets called for mapping the top-level task.
How are IDs associated with mappers? I'm calling replace_default_mapper and only see add_mapper takes both an ID and a mapper.
That should be fine. How do I run your code?
cd /scratch2/seshu/legion_s3d_no_cuda/Ammonia_Cases
salloc -N 1 -p cpu --exclusive
./ammonia_job.sh
You're setting options.map_locally = true
which disables replication as you're not allowed map the shards of a replicated task locally. You would be seeing this warning if you weren't suppressing warnings from Legion. I recommend that you stop suppressing warnings from Legion so you can actually see messages like this.
Ok. After fixing that issue it looks like its working. Thanks!
I have updated to the latest control_replication in S3D and changed the mapper to use replicate_task, but I'm seeing a freeze when running multiple ranks. 1 rank seems to work.
Here are stack traces from a 2 rank run:
Im using this commit: