[RFC] Cross-Platform Refactor: Testing and CI/CD Strategy

The goal of this issue is to:

Combine discussions on testing frameworks, CI/CD infrastructure, and handling flaky tests into one comprehensive RFC.
Address how to run tests across different backends, the development of a mature CI system that covers each cross-platform backend, and strategies to ensure test reproducibility and tackle flaky tests.
Discuss the current limitations, such as the lack of GPU-accelerated GitHub runners for Linux and the manual testing process, and explore solutions for a more automated and inclusive testing environment.

Tackling Flaky Tests and Reproducibility Issues: Address the challenges of flaky tests and non-determinism in outputs, seeking community input on establishing more robust testing practices.

Enhancing the Testing Framework for Multi-Backend Support: Focus on discussing the strategies for testing different backends, addressing the core question of how to ensure comprehensive test coverage across all supported platforms without hindering contributions.

Status quo

The current status quo that we don't even have GPU accelerated Github runners (Linux). So far we manually test contributions or have them submit their test results in the PR. Recently there has been the addition of an integration pipeline (thanks @younesbelkada), testing the HF library integrations nightly, on the Hugging Face side, but these results cannot directly be fed back into the PRs (correct me if I'm wrong) and have been manually posted to the PR.

Support needed for Github Runner infra

Given that we're planning to add Intel, AMD, Mac, Windows support as officially supported, the current approach is no longer feasible. In order to find a way forward that works well for all the just mentioned platforms, we would like to have an open discussion in this thread to try and align everyone around an approach that preferably works for everyone.

It will be hard to change this architecture decision once it's in place, that's why we want to take this step with caution, including stakeholders from each platform. To summarize, we think a cross-platform CI is needed to make sure we're releasing without fear of breaking the non-Linux + CUDA backends. We would welcome the communities contributions in order to make this a reality.

Unfortunately, we cannot use the HF organization's Github runner infrastructure, as BNB shall remain independent and using the runners would not comply with HFs security policies as long as BNB is not part of the org. This means in practice that BNB will need support in the form of compute grants to run CI/CD. Hugging Face already agreed to step up for a portion of that. However, Intel and AMD would need to contribute from their side both with compute and helping us build up the CI/CD system to make this run smoothly.

Concrete testing challenges

From what I can see, we would need to solve the following concrete testing challenges:

Reproducibility:
- Make sure we can run the BNB test suite reproducibly
- We're still seeing flaky tests, as there is still some non-determinism there, because of
  1. needing to add comprehensive manual seeds
  2. different GPU models leading to slight differences in output that may lie outside of the specified tolerance. These bounds might have to be slightly different for different hardware platforms.
Integration testing:
1. It might make sense to have minimal tests that validate the integration points themselves.
2. Be sure that our architecture decision allows us to run the whole test suite just the same, but switching to different hardware backends based on the platform, validating the expected behavior of BNB.
We need to work towards a proper CI system and put in place the necessary infra of code + hardware-accelerated runners to enable quick feedback on new contributions.

TimDettmers / bitsandbytes