Open elliottslaughter opened 2 months ago
@elliottslaughter can we get the output of the following log message? I think we need better error handling here. I really wish we would stop asserting / aborting on error handling...
if (ret != 0)
{
const char *message = realm_strerror(errno);
log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
abort();
}
@muraj You can see it in the linked CI run:
[0 - 7f6b033f4c80] 0.634381 {6}{aio}: Failed asynchronous IO read [0]: Success
The Realm code is incorrect. AIO does not set errno
at all, nor is it correct to query errno
after a failed return. I think what you really want to see is the value of ret
in that API call:
I rebased on top of the newest MR 1440 and applied the following local patch:
diff --git a/runtime/realm/transfer/lowlevel_dma.cc b/runtime/realm/transfer/lowlevel_dma.cc
index 6d64c79be..79ab8c165 100644
--- a/runtime/realm/transfer/lowlevel_dma.cc
+++ b/runtime/realm/transfer/lowlevel_dma.cc
@@ -345,8 +345,8 @@ namespace Realm {
static_cast<void *>(&cb), ret);
if (ret != 0)
{
- const char *message = realm_strerror(errno);
- log_aio.fatal("Failed asynchronous IO read [%d]: %s", errno, message);
+ const char *message = realm_strerror(ret);
+ log_aio.fatal("Failed asynchronous IO read [%d]: %s", ret, message);
abort();
}
}
Now I'm getting:
[0 - 7fd113bfac40] 1.316147 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
The fix for the AIO error reporting is here: https://gitlab.com/StanfordLegion/legion/-/merge_requests/1456
For the record this still reproduces on Sapling with the instructions at the top of this issue.
Bad file descriptor is interesting. Is it possible to get an strace output that follows the file descriptor? I am wondering if this is related to the issue that Wei is working on with the IO bgworkers.
Edit: Disregard these logs, they mixed output from both ranks into one file.
I updated the test.sh
script on Sapling to collect logs via strace
. Here's an example of a failing run test_1
:
You're also welcome to look at the entire collection of output from other runs on Sapling at /scratch/eslaught/resilience-test-network/strace_output/test_*
.
I'm not entirely sure what you're looking for in the logs, but one thing I noticed is that logs contain either:
./test_13/output.log:[0 - 7f3de78fac40] 1.053084 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
Or they contain:
./test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8e07ed3c) = -1 EBADF (Bad file descriptor)
./test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe1cf1af9c) = -1 EBADF (Bad file descriptor)
But they never contain both at the same time. I'm not sure if this is just file buffering issue (though I have tried to flush things as aggressively as possible) or if this is somehow meaningful.
Thanks, I'll take a closer look later tonight. I'm trying to trace through when this file descriptor is closed and why. I suspect there is a race between this async read operation and a close somewhere that is invalidating the file descriptor. But I wanted to confirm that first with the strace logs.
If I understand the log correctly, I guess the issue is from:
shutdown(15, SHUT_RDWR) = 0
close(15) = 0
epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff51a2e42c) = -1 EBADF (Bad file descriptor)
we close the fd=15, and then epoll_ctl it.
Yeah, that's what I suspected. Now the question is, where is the close coming from, and why is it before the async read is complete?
The fd is created by FileMemory::attempt_register_external_resource
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_disk.cc#L161 and closed by FileMemory::unregister_external_resource
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_disk.cc#L217, which are called by applications. I am not sure if it is the application's fault that the file is closed too early.
@elliottslaughter I tried to grep -r "Bad file descriptor" * under your logs,
test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4ecfeefc) = -1 EBADF (Bad file descriptor)
test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcc2732fec) = -1 EBADF (Bad file descriptor)
test_1/output.log:[0 - 7fafd8123c40] 1.250129 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_10/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0fd8cfac) = -1 EBADF (Bad file descriptor)
test_10/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffbf98589c) = -1 EBADF (Bad file descriptor)
test_11/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0d9becec) = -1 EBADF (Bad file descriptor)
test_11/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8c136b9c) = -1 EBADF (Bad file descriptor)
test_12/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe7a19ce6c) = -1 EBADF (Bad file descriptor)
test_12/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdd0adc6fc) = -1 EBADF (Bad file descriptor)
test_13/output.log:[0 - 7f3de78fac40] 1.053084 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_14/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff0985e1ac) = -1 EBADF (Bad file descriptor)
test_14/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffce4b272dc) = -1 EBADF (Bad file descriptor)
test_15/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffea041c8bc) = -1 EBADF (Bad file descriptor)
test_15/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd3e62fc4c) = -1 EBADF (Bad file descriptor)
test_16/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0c1d849c) = -1 EBADF (Bad file descriptor)
test_17/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe78182f3c) = -1 EBADF (Bad file descriptor)
test_17/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffec178aefc) = -1 EBADF (Bad file descriptor)
test_18/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc9d4c549c) = -1 EBADF (Bad file descriptor)
test_18/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff75baa5ac) = -1 EBADF (Bad file descriptor)
test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd8e07ed3c) = -1 EBADF (Bad file descriptor)
test_19/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe1cf1af9c) = -1 EBADF (Bad file descriptor)
test_2/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc7b30b29c) = -1 EBADF (Bad file descriptor)
test_2/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdd8ea2d3c) = -1 EBADF (Bad file descriptor)
test_20/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff7b7bf0ec) = -1 EBADF (Bad file descriptor)
test_20/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcddc33eec) = -1 EBADF (Bad file descriptor)
test_21/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc3f5afd8c) = -1 EBADF (Bad file descriptor)
test_21/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffee46d995c) = -1 EBADF (Bad file descriptor)
test_22/output.log:[0 - 7f67c81b6c40] 0.894134 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_23/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe093a915c) = -1 EBADF (Bad file descriptor)
test_23/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd3062e8bc) = -1 EBADF (Bad file descriptor)
test_24/output.log:[0 - 7fd3f8b68c40] 1.211296 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_25/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffc962893c) = -1 EBADF (Bad file descriptor)
test_25/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcbad72a1c) = -1 EBADF (Bad file descriptor)
test_26/output.log:[0 - 7fac88d92c40] 1.004523 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_27/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe6819513c) = -1 EBADF (Bad file descriptor)
test_27/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe860aa9bc) = -1 EBADF (Bad file descriptor)
test_28/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe73bd5e3c) = -1 EBADF (Bad file descriptor)
test_28/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffdf562cec) = -1 EBADF (Bad file descriptor)
test_29/output.log:[0 - 7fef361a4c40] 1.049102 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_3/output.log:[0 - 7ff7ba6a9c40] 1.039958 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_30/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff96dbebbc) = -1 EBADF (Bad file descriptor)
test_30/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffded2ea7ec) = -1 EBADF (Bad file descriptor)
test_31/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcceb3f77c) = -1 EBADF (Bad file descriptor)
test_31/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd91a6085c) = -1 EBADF (Bad file descriptor)
test_32/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc788ec00c) = -1 EBADF (Bad file descriptor)
test_32/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdc7d994ac) = -1 EBADF (Bad file descriptor)
test_33/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fffe61e9eac) = -1 EBADF (Bad file descriptor)
test_33/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffea0f2d8ac) = -1 EBADF (Bad file descriptor)
test_34/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe5e9ee28c) = -1 EBADF (Bad file descriptor)
test_34/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe3ceee46c) = -1 EBADF (Bad file descriptor)
test_35/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc4cf3e17c) = -1 EBADF (Bad file descriptor)
test_35/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd98f3b48c) = -1 EBADF (Bad file descriptor)
test_36/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe5b0cb2fc) = -1 EBADF (Bad file descriptor)
test_36/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc0b79cdcc) = -1 EBADF (Bad file descriptor)
test_37/output.log:[0 - 7f000c137c40] 0.934232 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_38/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc13eaa9ac) = -1 EBADF (Bad file descriptor)
test_38/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4037fdec) = -1 EBADF (Bad file descriptor)
test_39/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffedd5d141c) = -1 EBADF (Bad file descriptor)
test_39/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc21d662ac) = -1 EBADF (Bad file descriptor)
test_4/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff51a2e42c) = -1 EBADF (Bad file descriptor)
test_4/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffcaa5f5a2c) = -1 EBADF (Bad file descriptor)
test_40/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffdec07645c) = -1 EBADF (Bad file descriptor)
test_40/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff43db0afc) = -1 EBADF (Bad file descriptor)
test_5/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd9334317c) = -1 EBADF (Bad file descriptor)
test_6/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff4d68014c) = -1 EBADF (Bad file descriptor)
test_6/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc3d44572c) = -1 EBADF (Bad file descriptor)
test_7/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffeab01d4ec) = -1 EBADF (Bad file descriptor)
test_7/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff5538572c) = -1 EBADF (Bad file descriptor)
test_8/output.log:[0 - 7f935423bc40] 0.943846 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
test_9/output.log:[0 - 7fbf58f5ec40] 1.278778 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
I realized that the Failed asynchronous IO read [9]: Bad file descriptor
in output.log does not match with the epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffc13eaa9ac) = -1 EBADF (Bad file descriptor)
in strace.log. For example,
we have seen test_0/strace.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffe4ecfeefc) = -1 EBADF (Bad file descriptor)
in test_0, but actually test_0 is finished correctly. In the other hand, we have seen test_1/output.log:[0 - 7fafd8123c40] 1.250129 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
on test_1, but no error on the strace.log.
Sorry, the files probably got mixed up because these were multi-node runs and I wasn't properly piping the output of each rank to a separate file. Please disregard the files above and look at the ones below, which are properly split per-rank:
test_0
hits:
test_0/output_0.log:[0 - 7ffaa42b0c40] 1.399569 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
Files: output_0.log, output_1.log, strace_0.log, strace_1.log
test_2
hits:
test_2/strace_0.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7ffd1b81f8bc) = -1 EBADF (Bad file descriptor)
test_2/strace_1.log:epoll_ctl(11, EPOLL_CTL_DEL, 15, 0x7fff0f34f86c) = -1 EBADF (Bad file descriptor)
Files: output_0.log output_1.log strace_0.log strace_1.log
Once again I note that jobs hit one or the other but not both.
You can look at output from all of the jobs on Sapling: /scratch/eslaught/resilience-test-network/strace_output_per_rank
If not both, it is hard to tell if the "Bad file descriptor" from strace is from the realm AIO.
Could you please apply the patch and re-genenrate the logs, or if you can tell me how to pipe the log of each rank correctly?
diff --git a/runtime/realm/transfer/channel_disk.cc b/runtime/realm/transfer/channel_disk.cc
index 492598c8f..62a83ab03 100644
--- a/runtime/realm/transfer/channel_disk.cc
+++ b/runtime/realm/transfer/channel_disk.cc
@@ -39,6 +39,7 @@ static int fsync(int fd)
#endif
namespace Realm {
+ extern Logger log_disk;
FileXferDes::FileXferDes(uintptr_t _dma_op, Channel *_channel,
NodeID _launch_node, XferDesID _guid,
@@ -66,6 +67,7 @@ namespace Realm {
RegionInstanceImpl *impl = get_runtime()->get_instance_impl(inst);
file_info = static_cast<FileMemory::OpenFileInfo *>(impl->metadata.mem_specific);
+ log_disk.print("create FileXferDes, fd=%d", file_info->fd);
const int max_nr = 10; // FIXME
file_reqs = (FileRequest*) calloc(max_nr, sizeof(DiskRequest));
diff --git a/runtime/realm/transfer/lowlevel_disk.cc b/runtime/realm/transfer/lowlevel_disk.cc
index 6f78b42e5..90c2f175b 100644
--- a/runtime/realm/transfer/lowlevel_disk.cc
+++ b/runtime/realm/transfer/lowlevel_disk.cc
@@ -204,6 +204,7 @@ namespace Realm {
OpenFileInfo *info = new OpenFileInfo;
info->fd = fd;
info->offset = res->offset;
+ log_disk.print("attempt_register_external_resource inst=%llx, fd=%d", inst->me.id, info->fd);
inst->metadata.add_mem_specific(info);
return true;
@@ -219,6 +220,7 @@ namespace Realm {
OpenFileInfo *info = inst->metadata.find_mem_specific<OpenFileInfo>();
assert(info != 0);
int ret = close(info->fd);
+ log_disk.print("unregister_external_resource inst=%llx, fd=%d", inst->me.id, info->fd);
if(ret == -1) {
log_disk.warning() << "file failed to close cleanly, disk contents may be corrupted";
}
I would like to know if the Failed asynchronous IO read
error is printed after or before we close the file.
To be clear, the files in https://github.com/StanfordLegion/legion/issues/1753#issuecomment-2342574312 were generated with correctly piping output per-rank. I believe that the property that we see errors in either the output or strace is a fundamental property of the problem we are looking at, not a mistake.
But I can certainly generate logs with the diff you suggested.
I believe that the property that we see errors in either the output or strace is a fundamental property of the problem we are looking at, not a mistake.
I would like to understand why the error could show up in the strace but not in the output. Where does this error come from?
I put the log files you requested here:
/scratch/eslaught/resilience-test-network/strace_output_per_rank_with_extra_logging
E.g., in test_0/output_0.log
you'll see:
[0 - 7f4963860c40] 1.171475 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
Done!
[0 - 7f4960967c40] 1.288046 {3}{disk}: create FileXferDes, fd=22
[0 - 7f4963860c40] 1.324472 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
[0 - 7f4960a6bc40] 1.333938 {6}{aio}: Failed asynchronous IO read [9]: Bad file descriptor
I would like to understand why the error could show up in the strace but not in the output. Where does this error come from?
I don't know. But one thing to keep in mind is that strace
reports syscalls. When we call aio_read
or whatever, that's a function in libc. Maybe it maps directly to a syscall, or maybe it doesn't. Perhaps the aio_read
function performs additional internal syscalls and in some scenarios it handles the error, and in other scenarios it doesn't? I'm speculating, but depending on how complex that function is, it's at least possible.
Based on the new logs, it is almost certain that we close the file before the aio call. There could be two causes:
BTW, here is the stacktrace of closing the file, I am not sure if we should blame legion/realm/application.
#9 0x00007f8673736c5d in Realm::FileMemory::unregister_external_resource (this=0x55b2d6b9c310, inst=0x7f85f6795820) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/transfer/lowlevel_disk.cc:228
#10 0x00007f8673736ec6 in Realm::FileMemory::release_storage_immediate (this=0x55b2d6b9c310, inst=0x7f85f6795820, poisoned=false, work_until=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/transfer/lowlevel_disk.cc:260
#11 0x00007f8673b3ee7e in Realm::MemoryImpl::release_storage_deferrable (this=0x55b2d6b9c310, inst=0x7f85f6795820, precondition=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/mem_impl.cc:196
#12 0x00007f8673a9d82e in Realm::RegionInstance::destroy (this=0x7f85f6796e20, wait_on=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/inst_impl.cc:521
#13 0x00007f867680a7d1 in Legion::Internal::PhysicalManager::perform_deletion (this=0x7f85f6796c40, source=0, i_lock=0x7f85f4200100) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3212
#14 0x00007f867680a5b7 in Legion::Internal::PhysicalManager::perform_deletion (this=0x7f85f6796c40, source=0, i_lock=0x0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3179
#15 0x00007f8676cb51da in Legion::Internal::MemoryManager::detach_external_instance (this=0x7f85f6796840, manager=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/runtime.cc:11096
#16 0x00007f867680ac9d in Legion::Internal::PhysicalManager::detach_external_instance (this=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_instances.cc:3310
#17 0x00007f86769ae6e3 in Legion::Internal::ReplDetachOp::detach_external_instance (this=0x7f85fa911e40, manager=0x7f85f6796c40) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_replication.cc:7078
#18 0x00007f867688d954 in Legion::Internal::DetachOp::trigger_complete (this=0x7f85fa911e40, effects=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_ops.cc:21175
#19 0x00007f867683e749 in Legion::Internal::Operation::complete_mapping (this=0x7f85fa911e40, wait_on=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_ops.cc:2019
#20 0x00007f86766e2465 in Legion::Internal::InnerContext::process_deferred_mapped_queue (this=0x7f85f6789650, precondition=..., previous_fevent=..., performed=0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_context.cc:8880
#21 0x00007f86766f033c in Legion::Internal::InnerContext::handle_deferred_mapped_queue (args=0x55b2da2d6dc0) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/legion_context.cc:12086
#22 0x00007f8676d07984 in Legion::Internal::Runtime::legion_runtime_task (args=0x55b2da2d6dc0, arglen=36, userdata=0x55b2da2d27e0, userlen=8, p=...) at /scratch2/wwu/resilience-test-network/legion/runtime/legion/runtime.cc:32502
#23 0x00007f8673b7e7f0 in Realm::LocalTaskProcessor::execute_task (this=0x55b2d6abf6e0, func_id=4, task_args=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/proc_impl.cc:1176
#24 0x00007f8673bfb5dc in Realm::Task::execute_on_processor (this=0x55b2da2d6c40, p=...) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:326
#25 0x00007f8673c00956 in Realm::UserThreadTaskScheduler::execute_task (this=0x55b2d6bf6250, task=0x55b2da2d6c40) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:1687
#26 0x00007f8673bfe6f1 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x55b2d6bf6250) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/tasks.cc:1160
#27 0x00007f8673c06dce in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x55b2d6bf6250) at /scratch2/wwu/resilience-test-network/legion/runtime/realm/threads.inl:97
Is it even possible for a Legion application to detach an instance too early? The only way to do that is to flip the execution order of tasks, and then the instance would be simply invalid and the program would probably hit a deterministic error in Legion before ever reaching an I/O failure.
To me, your stack trace suggests Legion is processing the detach op through the runtime analysis pipeline, so if this is somehow out of order, it would be a dependence analysis failure in Legion.
I enabled the logging of checking the completion of aio https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/transfer/lowlevel_dma.cc#L603,
In the first run, the log is:
[0 - 7f53f418bc40] 2.357283 {3}{disk}: create FileXferDes, fd=22
[0 - 7f53f418bc40] 2.365385 {3}{aio}: aio op completed: op=0x7f53a001cb70
[0 - 7f53f5d38c40] 2.409349 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
but in the second run (the failed one), the log becomes:
[0 - 7fe01363ac40] 1.273747 {3}{disk}: create FileXferDes, fd=22
[0 - 7fe01a4acc40] 1.290690 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
[0 - 7fe01363ac40] 1.302596 {3}{aio}: aio op completed: op=0x7fdfa401f9d0
Apparently, the detach op is issued too early.
I added more logs to check if the FileXferDes
is mark_completed before the detach. In failure runs, we detach the file before the xd is marked as completed. @lightsighter do you think if it is a bug in Legion?
Legion makes sure that the completion events for all copies involving an instance (including file instances) are used as preconditions for deleting the instance. I suspect that the problem is that some of the asynchronous I/O operations are not being correctly reflected in the completion event of the copies.
@elliottslaughter do you think if you could create a simple legion reproducer to reproduce the pattern that you are using for regent checkpointing? I tried to create a reproducer to do:
attach_external_resource
issue_copy_operation
detach_external_resource
But I can not reproduce the out of order issue (the detach is issued before the completion of aio). Even if I manually slow down the aio code, I can not reproduce it.
I tried to run your regent test sequentially, and the out of order issue is disappeared. Apparently, the issue only happens with parallel runs. However, I can even see the out of order logs in the output_0.log
for some success runs, then I am not sure if the "out of order" is due to realm may print the logging out of order.
However, I can even see the out of order logs in the output_0.log for some success runs, then I am not sure if the "out of order" is due to realm may print the logging out of order.
Realm's logging will be in the order that it was logged in time from the same node, so I think you are actually seeing the two different orderings with your print. Can you log the completion events for all those aio copies in Realm and then also run with event=1
to see if those events trigger before or after the aio operations are done?
@lightsighter here is what I found from the log, the logs are divided into rank 0 and rank 1.
The logs of the success run:
The file copy operation is 0x7f7cb001f5b0
, the event is 8000801000d00003
, we can see the event is triggered before unregistering the external resource.
[1 - 7f7d2d65ac40] 0.435905 {2}{dma}: dma request 0x7f7cb001f5b0 created - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d2d65ac40] 0.435919 {2}{dma}: dma request 0x7f7cb001f5b0 ready - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d2d65ac40] 0.435925 {2}{dma}: dma request 0x7f7cb001f5b0 started - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
...
[1 - 7f7d266a9c40] 0.436674 {2}{dma}: dma request 0x7f7cb001f5b0 completed - plan=0x7f7cb000ed10 before=8000801000f00002 after=8000801000d00003
[1 - 7f7d266a9c40] 0.436679 {1}{event}: event triggered: event=8000801000d00003 by node 1 (poisoned=0)
[1 - 7f7d267adc40] 0.436922 {1}{event}: event subscription: node=0 event=8000801000d00003
[1 - 7f7d267adc40] 0.436930 {1}{event}: event subscription immediate trigger: node=0 event=8000801000d00003 (<= 3)
[0 - 7f4cfb7acc40] 0.413373 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
[0 - 7f4cf88b3c40] 0.436266 {3}{disk}: create FileXferDes, op=0x7f7cb001f5b0, fd=22, xd=0x7f4c7c003640
[0 - 7f4cf89b7c40] 0.436386 {2}{xd}: bytes_write: 10000000a(0) 44+0 -> 0
[0 - 7f4cf89b7c40] 0.436399 {2}{xd}: completion: xd=10000000a total_bytes=44 pending=44
[0 - 7f4cf89b7c40] 0.436525 {3}{aio}: enqueue_read launched: op=0x7f4cac044030, req=0x7f4c7c003b30, xd=0x7f4c7c003640
...
[0 - 7f4cf88b3c40] 0.436593 {3}{aio}: aio op completed: op=0x7f4cac044030, req=0x7f4c7c003b30, xd=0x7f4c7c003640
[0 - 7f4cf88b3c40] 0.436609 {2}{xd}: bytes_read: 10000000a(0) 0+44 -> 44
[0 - 7f4cf88b3c40] 0.436617 {2}{xd}: bytes_write: 10000000a(0) 0+44 -> 44
[0 - 7f4cf88b3c40] 0.436623 {2}{xd}: completion: xd=10000000a remaining=0
[0 - 7f4cf89b7c40] 0.436634 {3}{xd}: xd=0x7f4c7c003640 mark_completed
[0 - 7f4cfb7acc40] 0.436860 {2}{event}: deferring user event trigger: event=8000002001000002 wait_on=8000801000d00003
[0 - 7f4cf89b7c40] 0.436944 {1}{event}: event update: event=8000801000d00003 poisoned=0:{}
[0 - 7f4cfb7acc40] 0.439114 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
Here are the failed logs:
The copy 0x7fb44801f0f0
is not completed, and the event 8000801000c00004
is not triggered before unregistering the external resource.
[1 - 7fb4bbbeec40] 1.073064 {2}{dma}: dma request 0x7fb44801f0f0 created - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40] 1.073681 {2}{dma}: dma request 0x7fb44801f0f0 ready - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40] 1.073689 {2}{dma}: dma request 0x7fb44801f0f0 started - plan=0x7fb44800e7a0 before=8000801000e00002 after=8000801000c00004
[1 - 7fb4b8c8ec40] 1.074819 {1}{event}: event subscription: node=0 event=8000801000c00004
[1 - 7fb4b8c8ec40] 1.074827 {1}{event}: event subscription recorded: node=0 event=8000801000c00004 (> 3)
[1 - 7fb4b8c8ec40] 1.074832 {1}{event}: event subscription immediate trigger: node=0 event=8000801000c00004 (<= 3)
[1 - 7fb4bbbeec40] 1.078966 {2}{event}: deferring user event trigger: event=8000801000600008 wait_on=8000801000c00004
[0 - 7f1485c70c40] 0.981099 {3}{disk}: attempt_register_external_resource inst=4000000000800000, fd=22
[0 - 7f147edb6c40] 1.074091 {3}{disk}: create FileXferDes, op=0x7fb44801f0f0, fd=22, xd=0x7f1400019810
[0 - 7f147edb6c40] 1.074235 {2}{xd}: bytes_write: 10000000a(0) 44+0 -> 0
[0 - 7f147edb6c40] 1.074247 {2}{xd}: completion: xd=10000000a total_bytes=44 pending=44
[0 - 7f147ecb2c40] 1.074396 {1}{event}: event update: event=8000801000e00002 poisoned=0:{}
[0 - 7f147edb6c40] 1.074402 {3}{aio}: enqueue_read launched: op=0x7f140001e400, req=0x7f140001de00, xd=0x7f1400019810
[0 - 7f1485c70c40] 1.074671 {2}{event}: deferring user event trigger: event=8000002000800005 wait_on=8000801000c00004
...
[0 - 7f1485c70c40] 1.082443 {3}{disk}: unregister_external_resource inst=4000000000800000, fd=22
I can send you the full logs if you need.
@elliottslaughter Are you sure you fixed all the bugs trying to issue detaches after exiting a parent task? There are still no checks for that in the master branch so if you're doing it then it would cause a problem like this.
We delete the resilient runtime (which stores all the data structures) before running the postamble, if that's what you're asking:
I suppose if there were a set of loggers that listed the postamble execution and any potential detaches, that would answer the question definitively. Do you know if we have that?
There are no loggers that check that right now. The grandrefactor
branch has a check that you aren't using a context after calling legion_task_postamble
but it's hard to backport that to the master branch because it relies on a bunch of other refactoring in that branch.
I'd like to see detailed Legion Spy logging from a failing run along with the name of the completion event for the bad copy.
This is attempting to replicate a failure I've seen in the resilience CI, e.g. here: https://github.com/StanfordLegion/resilience/actions/runs/10647296358/job/29515103267
Failure mode:
Note the failure mode isn't the same as what I saw in CI, but this is the best I can do for now. It's possible that if we resolved this issue we'd see the other failure mode return afterwards.
Note I'm going to share the build with #1752, so watch out when you rebuild it:
To reproduce:
Reproduction ratio seems to be about 10%.
To rebuild: