Allow users to pass max_threads to the avif encoder via Image.save

Copy of #49 from @yit-b, which was accidentally closed when I unintentionally pushed this repository's main branch to the PR's origin (and now that the pull request is closed I am unable to push the correct ref). Original PR body:

It's not always desirable to set max_threads equal to the number of cpus on the host machine - especially in shared or containerized environments. Performance is often significantly worse if you have too many CPUs (see benchmarks below).

This PR gives users control over how many threads will be used for encoding by adding a max_threads argument accepted by Image.save(). Example:

save_test.py

import time
import pillow_avif
from PIL import Image

if __name__ == "__main__":
    img = Image.open("tests/images/flower.jpg")

    n_warmups = 1
    n_iters = 10

    for n_threads in range(0, 17):
        for _ in range(n_warmups):
            img.save("flower.avif", quality=80, max_threads=n_threads)
        start = time.time()
        for _ in range(n_iters):
            img.save("flower.avif", quality=80, max_threads=n_threads)
        print(
            f"N threads: {n_threads}. Avg time: {((time.time() - start) / n_iters * 1000):.2f} ms/img"
        )

Output:

N threads: 0. Avg time: 71.45 ms/img
N threads: 1. Avg time: 128.72 ms/img
N threads: 2. Avg time: 76.32 ms/img
N threads: 3. Avg time: 61.18 ms/img
N threads: 4. Avg time: 62.81 ms/img
N threads: 5. Avg time: 61.81 ms/img
N threads: 6. Avg time: 61.33 ms/img
N threads: 7. Avg time: 65.15 ms/img
N threads: 8. Avg time: 62.29 ms/img
N threads: 9. Avg time: 63.38 ms/img
N threads: 10. Avg time: 62.56 ms/img
N threads: 11. Avg time: 61.89 ms/img
N threads: 12. Avg time: 64.02 ms/img
N threads: 13. Avg time: 64.10 ms/img
N threads: 14. Avg time: 66.18 ms/img
N threads: 15. Avg time: 64.52 ms/img
N threads: 16. Avg time: 64.76 ms/img

The default behavior remains unchanged. It is 0 if not specified which is set to the cpu count.

Maybe this should be changed to a more reasonable default since performance seems the same with max_threads=0 (all CPUs) as it does with just 2 - probably because of contention? Maybe changing the default is outside the scope of this PR but I'd like to be able to tailor the parallelism to my compute environment.

Reproduce my tests:

conda activate foobar # This is some new empty conda environment
conda install -c conda-forge pillow 'libavif>=1.0.2' aom 'python=3.10.*' pillow
cd pillow-avif-plugin
pip install --no-deps .
python save_test.py

fdintino / pillow-avif-plugin

Allow users to pass max_threads to the avif encoder via Image.save #54