Establishing the test infrastructure

@CodeWithKyrian wrote in https://github.com/CodeWithKyrian/transformers-php/pull/36#issuecomment-2120853914:

If you notice, there aren't many tests in the library, which I'm not proud of. This is because I want to take my time to decide on the best structure for the tests. Classes like Tensor and Image can be easily tested, and I included tests for the basic tokenizer because the config file sizes are relatively small. However, the overall testing structure of the library is still largely undecided. [...] Since you seem to have a keen interest in testing, I would really appreciate your suggestions on how best to structure tests for the project.

I suggest select appropriate objectives and approaches/styles for the library first and then derive how to set up test infrastructure.

Based on the documentation and your summary in #34, my assumption is you have the following objectives etc.:

Mirror the functionality of the HuggingFace Python library as closely as possible
Use plain PHP wherever possible, combined with access to C libraries (and others) using the FFI extension
Imitate the code style so the transition for people from Python to PHP (and other languages) is as smooth as possible. Here is an example of the code style: https://codewithkyrian.github.io/transformers-php/summarization#running-a-pipeline-session
You provide the library as Open Source on Github, so I assume you wanna attract (or at least hope for) people who help you developing/maintaining it.

Feel free to object to any of these points or complement/update them, so there is no misunderstanding.

In the following a few remarks on the side:

Even though this library has a good example (= HuggingFace Python library) which outlines the direction of the development, its still very young. So there is much room for decisions to make.
It seems as if you wanna transfer the Python (functional) style of doing things also to the test environment. That's probably why you use the Pest test framework. Is this style of coding important to you or are you willing to change it (at least partly)? There is no need to justify it, because you are the author (Benevolent dictator for life, BDFL) of the library and a free to decide how things are suppose to be done.

Suggested test environment

1. Decide which framework to use: Pest or PHPUnit

I don't know the Pest framework much, but in my time in which I wrote tests for #36 it seemed cumbersome to use, because I don't like the functional style of doing things personally (point 3). This opinion is shared by others as well but on the other side, Pest provide a very nice output (reference).

PHPUnit is widely known and represents a major style how to write code in PHP. It is used by big PHP projects such as Symfony framework or Doctrine. If my assumption (point 4) is correct, you should consider using a tool which is likely more known to PHP developers. In my experience, people are either not familiar with testing at all, but if they are, they at least know PHPUnit.

:heavy_check_mark: I would switch to PHPUnit but explore, if Pest provides improvements in the output.

2. Decide which role a test plays

Tests can play various roles. Not only can they represent certain aspects of a software, they can also be used to show that a certain functionality is provided. I would use a very wide view on tests and use them for everything that suits this library, which means:

Write a test to show that a function behaves as intended (e.g. check return value for a given input)
Write a test to show that a (set of) function(s) acts inside certain parameters (e.g. memory limits for certain inputs).
Write a test to show that a certain misbehavior (bug) was fixed (e.g. each pull request containing a fix should demonstrate that it works)
Write a test to keep track of current problems. In a project of mine I used a skipping test to remind me, that a certain functionality still doesn't work. The idea was, that if it ever was fixed I get notified through my tests.

(and more ...)

3. Folder structure

The following structure did us a great service in various projects I (help to) maintain on Github (e.g. PDFParser, EasyRdf fork). The reason was, that its flexible enough to grow with the project but with enough structure to avoid files "flying around". The basic structure is:

test
|
`--- files
|       |
|       `--- forIssue33.txt
|       `--- ...
|
`--- TestCase.php        <== Root class for each test

tests
|
`--- IntegrationTests   <== Majority of tests: because everything is entangled and people usually don't care
|     |                  or don't know it better, integration tests is a good fit.
|     |                 It usually contains unit tests, system tests ... too.
|     |
|     `--- Class1Test.php
|     `--- ...
|
`---- ModelDependentTests   <== if it makes sense, test the library using certain models to check boundaries etc.
|     `--- ModelXTest.php
|     `--- ...
|
`---- PerformanceTests
...

Each folder in tests should represent a major area of interest to the library. I can imagine overall performance or memory usage is of high priority. In these cases a separate test area (such as PerformanceTests) might help, so long running tests or tests which need a special environment don't pollute, for instance, the integration tests (folder IntegrationTests). Also, not all tests have to run on each new commit. Some might only run when a new PR is created. These things might need some time to observe and configure properly.

All tests should run as part of the Continuous Integration pipeline here (using Github Actions). This is well documented and people using it for a wide variety of things (e.g. compilations, tests, data aggregation).

4. Include static code analysis

I will keep this one short. Static code analyzers such as PHPstan just use the source code (+ some config) and don't need custom test cases. They use certain rules (based on a given configuration). One of the major benefits is their ability to find errors the developer usually doesn't think about. Also, they help to establish type safety in the code base. Feel free to ask for more info.

Please read this as a suggestion and feel free to do whatever you want with it, no hard feelings. I might send further PRs in the future, but it really depends my available time. Just wanted to add this so there is no misunderstanding.

Hi @k00ni,

First off, I want to sincerely thank you for taking the time to present your thoughts in such an articulate and well-thought-out manner. These suggestions are very helpful and valuable.

Framework Choice: Pest or PHPUnit

Your assumptions from my summary of the objectives are spot on. However, I want to clarify my choice of using PestPHP. It's not necessarily because it's functional and similar to the Python style. I believe it has to do with the fact that I have a bias towards PestPHP. Before now, I honestly believed many PHP developers were comfortable with it and preferred it, but I can see that's probably because most in my circle are in the Laravel ecosystem - where Pest is widely supported and used. Setting up PestPHP is actually very straightforward for me, and I genuinely enjoy its syntax and output format, but I understand that this might not be the case for every PHP developer, especially those outside the Laravel community. You're right; I want to attract as many capable hands as possible to the project to get it to a more stable form.

I'm a stickler for OOP in PHP but I surprisingly enjoy the simple functional syntax of PestPHP. I especially love the output; it's a deal-breaker for me. However, I've noted your point and I'm open to exploring the possibility of using PHPUnit, especially since the Pest binary can also run PHPUnit tests, potentially giving us the best of both worlds. I'll consider this and conclude whether I should switch entirely to PHPUnit or stick with PestPHP. In the end though, this isn't the main factor holding me back.

Role of Tests and Test Environment

Regarding the tests, initially, I thought of structuring the folders to match the src folder structure but eventually decided against it. For components like Pretokenizers, Decoders, PostProcessors, and Normalizers, I can easily write unit tests for those. For the Tokenizers proper, which utilize some of these components depending on the config, I plan to write tests to ensure they invoke the right components and return the correct values. I could also apply a similar approach to Processors and Feature Extractors.

However, things get complicated when testing the Model classes. Testing whether they do the right thing with the right input is challenging because models come in various shapes and configs. Pulling real models from the HuggingFace hub to test against them would mean downloading a substantial amount of data (over 10GB) and copying many files, which would impact CI pipelines significantly. This is why I initially postponed this aspect of testing.

I considered training my own mini versions of the models that would be smaller, manually creating the vocab files, config files, etc., and testing behavior for different config settings. However, this approach also entails a lot of manual work, and this isn't my full time job 😮‍💨.

This week however, during my research, I discovered that HuggingFace created a new organization about a month ago, hf_internal_testing, with repositories containing very tiny versions of real-time models (~4MB) for testing model behavior. This fits our needs perfectly and got me excited. Transformers.js is already using them and Xenova being part of the HuggingFace team already uploaded the ONNX weights for a couple of them, so we might as well adopt them.

Folder Strucuture

Regarding the folder structure, you're right. I'm very interested in performance and memory usage, but my priority for now is general usage testing. For those tests, do you think it's wise to mirror the src folder? One other problem is that some tests might be long-running while others are not. It's not just the performance or memory usage tests that are long-running, and I agree that not all tests need to run with every commit. For such cases, I want to ensure everything works at the base level.

Static Analysis

As for static analysis, I've given it much thought, and I agree that it's a good starting point. I'll implement this while still researching the best structure for the tests.

Finally, while I'm the maintainer and the BDFL (funny name 😅), I'm building this for every PHP developer interested in the project. I want to attract more hands to help develop and maintain it. I appreciate your straightforwardness and want to assure you that you're not stepping on any toes. Your suggestions are invaluable, and I'm open to more.

Thank you once again for your thoughtful feedback and contributions.

CodeWithKyrian / transformers-php