Closed JohannesGaessler closed 2 months ago
I am not very familiar with the test-grad0
(or any of the training code), but I think this could be very well integrated into test-backend-ops
. The eval
mode could also be extended to optionally generate and compare backwards graphs, to compare the result with the CPU backend, though I don't know if that would add anything of value over just checking the gradients by computing them numerically.
I pushed an extension to test-backend-ops
that adds a new mode for numerically checking the gradients calculated by the backwards pass. Some comments:
GGML_ASSERT
to unsupported ops that would fail once the backwards pass is invoked. I did this for consistency but I think it's kind of a bad solution. I think long-term it would be good to at least have a CPU implementation for the backwards pass of all ops so that the compute graph can be safely constructed and simply checked with ggml_backend_supports_op
(right now there is additional logic in the test code to prevent crashes from the failed asserts).SUM
, ADD1
, and NEG
.The failure rate of SIN
and COS
was ~12%. With the adjusted parameters the failure rate for a threshold of $10^{-4}$ was ~4%. For the increased threshold of $10^{-3}$ the failure rate is ~0.1%.
There are a lot more cases like the last two. If every created tensor needs to be a param, do it automatically in the ggml_new_tensor functions in parent class. If the reason some are disabled is because ggml does not support backwards pass, then extract that to a function that makes that clear, and disable it by default.
There are some tricky edge cases like CROSS_ENTROPY_LOSS
where the backwards pass calculates gradients only for the logits but not the labels even though a variation of the labels leads to a change in the output. Would something like this be okay:
should_be_param(const struct ggml_tensor * t)
to determine which tensors should be made parameters (and that defaults to false).ggml_new_tensor
with an optional argument for the tensor name that can be used to resolve edge cases in should_be_param
when it's not possible to do so by e.g. type.I think it would be enough to add some comments explaining the cases where ggml_set_param
is omitted, and remove the commented ones from the operators that do not support a backwards pass.
I added a small bit of documentation to the top of the file to explain the intended usage as well as a heavily commented example struct for a new op that covers the bare minimum needed for a new test. Going through the code I noticed that it's possible to avoid explicit logic for ggml_set_param
for at least part of the ops so I did that where possible.
My goal with this PR is to add tests that check gradients for the backwards pass in a generic way for all backends. My current state is that I can successfully calculate and check gradients for CUDA as long as all ops are FP32 and supported. This PR is nowhere near usable but I'm opening it anyways to discuss the best way to implement the tests. Right now I'm doing the tests via modification of
tests/test-grad0
(and some hacks) but I think long-term it would make more sense to re-use the code intests/test-backend-ops
. From my perspective the easiest way to do this would be extend the existing code constructs intests/test-backend-ops
with a new mode that checks gradients numerically. @slaren your feedback would be appreciated.