Rewrite job pool to support nested parallelism for improved scalability

Before this change, given N images to encode, the caller had two options:

Use basis_compress for each image; internally this API would use multiple threads (when possible) to process large images on multiple threads
Use basis_parallel_compress for the group of images; internally this API would encode each image on one thread

Neither of these would be optimal. For example, given 4 4K images and a 16-core system, basis_parallel_compress is suboptimal because it would only use 4 out of 16 cores for encoding; basis_compress would be better, but it still would be suboptimal because encoding of a large image doesn't scale perfectly and there would be gaps of inactivity where only one core would do the work (notably, decoding the image from PNG inputs or rescaling to compute mips). Also, scaling would decrease in efficiency for smaller mips even for one image.

To fully saturate the system, you could use basis_compress and call it from multiple threads. This solves the problem of having enough concurrent work, but results in overhead due to context switches/thread preemption. It's difficult to size the two pools required in this case (outer and inner) wrt how many threads each has - especially if the goal is to reach a given level of total parallelism that is less than the system provides (e.g. -j16 on a 24-thread system) to avoid system wide performance issues due to oversubscription.

For this to work well we ideally need to have basis_parallel_compress use a single job pool - both for "outer" processing, handling the flow of image decoding, and for "inner" processing (threading the work required for one image). This would get us the best of both worlds - a predictable total system fanout limited by the job pool capacity, and a full saturation of the said pool.

For this to work well, we need to change the mechanics of job dispatch. Notably, when we wait for the jobs to finish, it's important that this only interacts with the jobs launched for the current image - both in terms of which jobs we wait for, and which jobs we help process. Without this there's going to be cross-talk between individual "outer" jobs which can result in poor scalability.

To solve this, the job_pool API gets augmented with optional tokens - these are just outstanding job counters, and the pointer to the counter is saved in the job struct, which allows us to selectively steal only jobs relevant for a given token. The token is kept optional just in case simpler uses of this API are interesting elsewhere in this project, although all image encoding really should use tokens to work with nested parallelism - forgetting it in the nested calls may lead to deadlocks as two running jobs would wait for each other with no forward progress.

Note that all operations on tokens are protected by the mutex so we don't need to use atomics. In fact, the use of atomics in the old code contained a bug that lead to a rare deadlock on shutdown, see https://cohost.org/zeux/post/520125-condition-variables.

This code has been extensively tested as part of gltfpack. Performance numbers on an example glTF model with 12 2K textures using ETC1S encoding on AMD Ryzen 5900X (12-core/24-thread system):

basis_compress one image at a time, no threading: 44s
basis_compress one image at a time: 12.5s
basis_parallel_compress before this change: 9.0s
basis_parallel_compress after this change: 4.2s

(this change was tested on a variety of models and should never degrade performance and almost always result in an improvement, the above is just an example - notably, even though all textures there are the same size, they don't take the same amount of time to compress, so even with 12 serially compressed textures which is what basis_parallel_compress did by default, we have a very significant improvement in overall time with the new code)

BinomialLLC / basis_universal

Rewrite job pool to support nested parallelism for improved scalability #339