While working on the training code I encountered an issue where sometimes the gradients would not be calculated. After some debugging the problem seems to be undefined behavior in ggml_graph_cpy. The code inserts all keys that are not null. But the keys themselves are never cleared since the intended way to use the code is to check whether the keys are valid via ggml_bitset_get. As a consequence, garbage keys can be copied to the visited_hash_set of the new graph and treated as valid keys since they are treated as in use after insertion. This garbage can then prevent the backwards graph from being properly constructed so the gradients are not calculated.
I'm spinning the fix out into a separate PR since the bug is rather simple but very time-consuming to track down. As far as I can tell ggml_graph_cpy is not being used in llama.cpp and whisper.cpp (outside of test code). There are more checks against keys being null in ggml_recompute_graph_node which are possibly also wrong but I don't know what the code does and the worst thing that could happen there is that a GGML_ASSERT is triggered when it shouldn't be.
While working on the training code I encountered an issue where sometimes the gradients would not be calculated. After some debugging the problem seems to be undefined behavior in
ggml_graph_cpy
. The code inserts all keys that are not null. But the keys themselves are never cleared since the intended way to use the code is to check whether the keys are valid viaggml_bitset_get
. As a consequence, garbage keys can be copied to thevisited_hash_set
of the new graph and treated as valid keys since they are treated as in use after insertion. This garbage can then prevent the backwards graph from being properly constructed so the gradients are not calculated.I'm spinning the fix out into a separate PR since the bug is rather simple but very time-consuming to track down. As far as I can tell
ggml_graph_cpy
is not being used inllama.cpp
andwhisper.cpp
(outside of test code). There are more checks against keys being null inggml_recompute_graph_node
which are possibly also wrong but I don't know what the code does and the worst thing that could happen there is that aGGML_ASSERT
is triggered when it shouldn't be.