[Proposal] Add Automatic (Unit) Testing and CI Workflows

Automatic testing is fundamental to keep a collaborative developed project from endless bugs corrupting modules that originally work. As for a deep learning library, always running the whole training or analyzing process from the outermost can consume lots of time and computational resources. Minor bugs may also not be triggered in a fixed training setting. Thus, it's necessary to test at different levels to ensure proper functioning as much as possible.

I propose adding the following 4 categories of testing:

Unit testing: Testing if every innermost method works well with mock data, e.g. a single forward pass in a minimal SAE, a single generation of activation. Unit testing should cover almost all parts of the library, so every single test is required to run fast.
Integrated testing: Testing if low-level modules work with one another properly, e.g. getting feature activation directly from text input (needs co-working of transformers and SAEs), a single training pass, and loading pretrained SAEs from HuggingFace. These tests should cover the common usage of the library at a rather high level. It also requires an acceptable time cost (maybe no more than several seconds). These tests should not depend on GPUs if possible.
Acceptance testing: Testing if modules work with a high performance (loss, memory allocated, time cost), e.g. if a pretrained SAE gives a reasonable loss. Some of these tests may require GPUs to run. Failure of these tests may be acceptable in some situations.
Benchmarks: Testing the time usage of a complete process and some bottleneck modules.

Continuous Integration (CI) with GitHub workflows should also be added to run testing on every push/PR. PRs should not be merged unless all tests are passed.

OpenMOSS / Language-Model-SAEs

[Proposal] Add Automatic (Unit) Testing and CI Workflows #16