Closed ggerganov closed 4 days ago
I managed to get a stacktrace for one of the seg faults:
Backend 1/3: Metal
Device description: Apple M2 Ultra
Device memory: 147456 MB (147450 MB free)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=1): OK
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=2): OK
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=3): OK
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=4): OK
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=5): OK
test_dataset(shuffle=no, ndata_shard=1, ndata_batch=6): OK
test_dataset(shuffle=no, ndata_shard=2, ndata_batch=2): OK
test_dataset(shuffle=no, ndata_shard=2, ndata_batch=4): OK
test_dataset(shuffle=no, ndata_shard=2, ndata_batch=6): OK
test_dataset(shuffle=no, ndata_shard=3, ndata_batch=3): OK
test_dataset(shuffle=no, ndata_shard=3, ndata_batch=6): OK
test_dataset(shuffle=no, ndata_shard=4, ndata_batch=4): OK
test_dataset(shuffle=no, ndata_shard=5, ndata_batch=5): OK
test_dataset(shuffle=no, ndata_shard=6, ndata_batch=6): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=1): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=2): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=3): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=4): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=5): OK
test_dataset(shuffle=yes, ndata_shard=1, ndata_batch=6): OK
test_dataset(shuffle=yes, ndata_shard=2, ndata_batch=2): OK
test_dataset(shuffle=yes, ndata_shard=2, ndata_batch=4): OK
test_dataset(shuffle=yes, ndata_shard=2, ndata_batch=6): OK
test_dataset(shuffle=yes, ndata_shard=3, ndata_batch=3): OK
test_dataset(shuffle=yes, ndata_shard=3, ndata_batch=6): OK
test_dataset(shuffle=yes, ndata_shard=4, ndata_batch=4): OK
test_dataset(shuffle=yes, ndata_shard=5, ndata_batch=5): OK
test_dataset(shuffle=yes, ndata_shard=6, ndata_batch=6): OK
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_grad(): OK
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_forward_backward(high_level=no, shuffle=no, subtest=results_initial): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_forward_backward(high_level=no, shuffle=no, subtest=weights_after_forward): OK
test_forward_backward(high_level=no, shuffle=no, subtest=results_after_forward): OK
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_forward_backward(high_level=no, shuffle=no, subtest=weights_after_forward_backward): OK
test_forward_backward(high_level=no, shuffle=no, subtest=result_after_forward_backward): OK
test_forward_backward(high_level=yes, shuffle=no, subtest=results_initial): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_forward_backward(high_level=yes, shuffle=no, subtest=weights_after_forward): OK
test_forward_backward(high_level=yes, shuffle=no, subtest=results_after_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_forward_backward(high_level=yes, shuffle=no, subtest=weights_after_forward_backward): OK
test_forward_backward(high_level=yes, shuffle=no, subtest=result_after_forward_backward): OK
test_forward_backward(high_level=yes, shuffle=yes, subtest=results_initial): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_forward_backward(high_level=yes, shuffle=yes, subtest=weights_after_forward): OK
test_forward_backward(high_level=yes, shuffle=yes, subtest=results_after_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_forward_backward(high_level=yes, shuffle=yes, subtest=weights_after_forward_backward): OK
test_forward_backward(high_level=yes, shuffle=yes, subtest=result_after_forward_backward): OK
test_epoch_vs_fit(): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=no, epoch=1, subtest=weights): OK
test_idata_split(high_level=no, epoch=1, subtest=results_backward): OK
test_idata_split(high_level=no, epoch=1, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=no, epoch=2, subtest=weights): OK
test_idata_split(high_level=no, epoch=2, subtest=results_backward): OK
test_idata_split(high_level=no, epoch=2, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=no, epoch=3, subtest=weights): OK
test_idata_split(high_level=no, epoch=3, subtest=results_backward): OK
test_idata_split(high_level=no, epoch=3, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=no, epoch=4, subtest=weights): OK
test_idata_split(high_level=no, epoch=4, subtest=results_backward): OK
test_idata_split(high_level=no, epoch=4, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=yes, epoch=1, subtest=weights): OK
test_idata_split(high_level=yes, epoch=1, subtest=results_backward): OK
test_idata_split(high_level=yes, epoch=1, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=yes, epoch=2, subtest=weights): OK
test_idata_split(high_level=yes, epoch=2, subtest=results_backward): OK
test_idata_split(high_level=yes, epoch=2, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=yes, epoch=3, subtest=weights): OK
test_idata_split(high_level=yes, epoch=3, subtest=results_backward): OK
test_idata_split(high_level=yes, epoch=3, subtest=results_forward): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
test_idata_split(high_level=yes, epoch=4, subtest=weights): OK
test_idata_split(high_level=yes, epoch=4, subtest=results_backward): OK
test_idata_split(high_level=yes, epoch=4, subtest=results_forward): OK
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=1, subtest=grads): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=1, subtest=weights): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=1, subtest=results): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=2, subtest=grads): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=2, subtest=weights): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=2, subtest=results): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=grads): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=weights): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=3, subtest=results): OK
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=grads): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=weights): OK
test_gradient_accumulation(high_level=no, nbatch_physical=2, loss_type=sum, epoch=4, subtest=results): OK
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
Process 52079 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xe8)
frame #0: 0x000000010038fc18 libggml-base.dylib`ggml_backend_tensor_get(tensor=0x0000000000000000, data=0x0000600000789244, offset=0, size=4) at ggml-backend.cpp:269:41
266 }
267
268 void ggml_backend_tensor_get(const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
-> 269 ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
270
271 if (size == 0) {
272 return;
Target 0: (test-opt) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xe8)
* frame #0: 0x000000010038fc18 libggml-base.dylib`ggml_backend_tensor_get(tensor=0x0000000000000000, data=0x0000600000789244, offset=0, size=4) at ggml-backend.cpp:269:41
frame #1: 0x0000000100008c5c test-opt`test_gradient_accumulation(backend_sched=0x000000011fa5b400, backend=0x00006000029a0a80, nbatch_physical=2, loss_type=GGML_OPT_LOSS_TYPE_MEAN) at test-opt.cpp:598:17
frame #2: 0x0000000100006974 test-opt`test_backend(backend_sched=0x000000011fa5b400, backend=0x00006000029a0a80) at test-opt.cpp:815:43
frame #3: 0x0000000100005ee0 test-opt`main at test-opt.cpp:865:38
frame #4: 0x000000019acdc274 dyld`start + 2840
(lldb) print *tensor
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) print tensor
(const ggml_tensor *) nullptr
(lldb) frame select 1
frame #1: 0x0000000100008c5c test-opt`test_gradient_accumulation(backend_sched=0x000000011fa5b400, backend=0x00006000029a0a80, nbatch_physical=2, loss_type=GGML_OPT_LOSS_TYPE_MEAN) at test-opt.cpp:598:17
595 ggml_opt_forward_backward(cd.opt_ctx, cd.result);
596
597 grad_history[idata + 0] = 0.0f;
-> 598 ggml_backend_tensor_get(ggml_opt_grad_acc(cd.opt_ctx, cd.weights), grad_history.data() + idata + 1, 0, 1*sizeof(float));
599 }
600 } else {
601 GGML_ASSERT(false);
(lldb)
When running test-opt
on a loop I also eventually see a test failure but the test failure manifests in a different way than on Georgi's machine. I always get a failure in test_regression
. However, when I remove all other tests test_regression
is no longer failing. The only common reference between the tests is an instance of ggml_backend_sched_t
. When I modified the tests to initialize and free a dedicated instance for each test I got the same failure pattern as Georgi. Also looking at the tests I'm noticing that we don't actually have any test code outside of test-opt
that utilizes ggml_backend_sched
so we can't conclusively tell whether ggml_backend_sched
or ggml_opt
is causing a failure in test-opt
. Even without the optimization code, ggml_backend_sched
is a fairly complex component and I think it would make sense to add tests (but if we do this we should coordinate). I would in principle be willing to write tests for ggml_backend_sched
but I don't feel very confident in my understanding of the code and will likely require assistance.
cc: @slaren
It is indirectly tested in any test that run llama.cpp. I agree it would be good to have tests for it, but it's not an easy component to write unit tests for. At some point I will probably rewrite it in C++ with testing in mind. I don't see how it could cause ggml_opt_grad_acc
to return NULL
, however.
I think I worded my post poorly. I agree that in this particular instance the bug is overwhelmingly likely in ggml_opt
. I was just thinking that tests would be nice to have in general.
@JohannesGaessler The
test-opt
seg faults from time to time:https://github.com/ggml-org/ci/tree/results/ggml/17/8ebfcc5f125085d51e0953b2d8230c21358650/ggml-4-x86-cuda-v100#ctest_release
I can reproduce this also on
master
with my CUDA box by letting the following command run for a while:The CPU backend would also occasionally fail in the
test_gradient_accumulation
test: