ggerganov / ggml

Tensor library for machine learning
MIT License
11.22k stars 1.04k forks source link

ggml: fix ggml_graph_cpy undefined behavior #943

Closed JohannesGaessler closed 2 months ago

JohannesGaessler commented 2 months ago

While working on the training code I encountered an issue where sometimes the gradients would not be calculated. After some debugging the problem seems to be undefined behavior in ggml_graph_cpy. The code inserts all keys that are not null. But the keys themselves are never cleared since the intended way to use the code is to check whether the keys are valid via ggml_bitset_get. As a consequence, garbage keys can be copied to the visited_hash_set of the new graph and treated as valid keys since they are treated as in use after insertion. This garbage can then prevent the backwards graph from being properly constructed so the gradients are not calculated.

I'm spinning the fix out into a separate PR since the bug is rather simple but very time-consuming to track down. As far as I can tell ggml_graph_cpy is not being used in llama.cpp and whisper.cpp (outside of test code). There are more checks against keys being null in ggml_recompute_graph_node which are possibly also wrong but I don't know what the code does and the worst thing that could happen there is that a GGML_ASSERT is triggered when it shouldn't be.