StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
679 stars 145 forks source link

Realm: better error message when failed to create an accessor #1772

Open eddy16112 opened 1 week ago

eddy16112 commented 1 week ago

When creating an accessor for a region instance whose metadata is not valid, Realm will throw an assertion

const Realm::InstanceLayoutGeneric* Realm::RegionInstance::get_layout() const: Assertion `r_impl->metadata.is_valid() && "instance metadata must be valid before accesses are performed"' failed

However, the assertion does not provide useful information for debugging. There are several cases that could trigger the assertion:

  1. the instance has been already destroyed or redistricted. I am not sure if we want to / can distinguish them.
  2. the allocation of the instance is not ready.
  3. the client forget to call fetch_metadata for remote instances.

Thus, it will be helpful if we can provide a detailed error message/code to tell which case is causing the failure.

lightsighter commented 1 week ago

I suspect this is going to be one of those things that is really hard to give a good error message. On the owner node at least you can mark a RegionInstanceImpl as being "alive" or not to tell the difference. If you're on a remote node though you probably don't have any way of telling if it is "alive" or not or just hasn't been paged in correctly yet.

eddy16112 commented 1 week ago

I just realized that we do not use accessor for remote instance any more. The only memory support remote access was global memory from GASNet1, which is almost deprecated. Maybe the IPC memory?

lightsighter commented 1 week ago

I think I can still make a GenericAccessor on a remote node for an instance in the REGDMA_MEM and then the generic accessor will turn the read and write calls into gets and puts using RDMA.

eddy16112 commented 1 week ago

REGDMA_MEM is just a LocalCPUMemory, and its put_bytes/get_bytes is memcpy, so I do not think it support remote access using accessor. Correct me if I am wrong.

If the REGDMA_MEM supports GenericAccessor. There are some active messages that are used to update the state of metadata of remote instances., so at least we can try to protect the constructor to make sure the instance is alive. Protecting the future read/write could be difficult, we might be able to provide a function for people to query the remote state after they are done with read/write, but I am not sure if it worths a roundtrip message.

lightsighter commented 1 week ago

REGDMA_MEM is just a LocalCPUMemory, and its put_bytes/get_bytes is memcpy, so I do not think it support remote access using accessor.

Huh, that is surprising to me. Maybe that is a regression but I thought we could do puts/gets to instances in REGDMA_MEM from remote nodes. If we can't it seems like a feature we should have at some point (not now).

Protecting the future read/write could be difficult, we might be able to provide a function for people to query the remote state after they are done with read/write, but I am not sure if it worths a roundtrip message.

I would make sure that whatever checking we do incurs near-zero overhead (some branches are fine since they can be predicted away).

eddy16112 commented 1 week ago

@lightsighter I just tried GenericAccessor on REGDMA_MEM, and it did not work. Because the RemoteMemory such as GASNetEXRemoteMemory or UCPRemoteMemory has no implementation of put_bytes/get_bytes https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_module.cc#L82.

I will not put any effort on this issue for now, but I plan to leave it open just in case we need it in the future.

lightsighter commented 1 week ago

Because the RemoteMemory such as GASNetEXRemoteMemory or UCPRemoteMemory has no implementation of put_bytes/get_bytes https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_module.cc#L82.

Ah, that's ok. The plumbing is there and my interpretation of what should be allowed is reasonable, we just haven't implemented it in the networking backends.

I will not put any effort on this issue for now, but I plan to leave it open just in case we need it in the future.

Sounds good. The code paths are there, just unimplemented until we need them which is fine.

apryakhin commented 1 week ago

Huh, that is surprising to me. Maybe that is a regression but I thought we could do puts/gets to instances in REGDMA_MEM from remote nodes. If we can't it seems like a feature we should have at some point (not now)

I was surprised that we don't have it too. I came across that when writing several integration tests and wanting to validate the remote data written. That's not a strong enough justification only for the feature to be pushed but would certainly remove some of the boiler-plate code we have in our integration tests on none perf-critical paths.