imread() parallelization issues

enricotagliavini commented 1 year ago

Hello,

I'm one of the IT experts working for FMI ( https://www.fmi.ch/ ). Some of the researchers working here started causing significant compute resources overload due to over-parallelization and we have tracked it down to be the imread() function from the tifffile library. Usually the user write their code using tifffile and they load and process multiple images in parallel using various frameworks. Relatively recently (last year?) tifffile got an update where the parallel decoding was implemented in imread and it's enabled by default using half of the CPU cores. Our machines range from 100 to 200 CPU cores, and the typical program process 32 or more tiff files in parallel, which means that, when 32 tiff files are read in parallel, we can have up to 3200 active threads, completely overloading the machine and causing a massive and unintended slowdown.

The new imread() seems to have changed the behavior of the function compared to before, requiring pretty much every single tifffile user to either change their code to adjust imread by specifying the maxworkers option or to use an old version of tifffile before the imread() parallelization.

I'm writing to you to kindly request to revert the default behavior to the original one, where nothing is parallelized by default, to avoid the pain of changing every single line of code using imread(). It's good to have the parallelization option, as it especially helps with compressed images, but it should be opt-in rather than opt-out. The efficiency of the parallelization depends on the image type and the number of CPU cores to be used also needs tuning on a case by case basis.

Most people will have a very hard time tracking this issue down. This problem is not trivial to spot, and it can take a significant effort to mitigate.

Thank you. Kind regards.

cgohlke commented 1 year ago

Thank you for reporting the issue. I am aware of it, but have not made up my mind to change the default behavior.

I would expect that users writing code to process dozens of files in parallel are aware of potential threading issues, the libraries they use, file storage and access patterns, etc. Especially users of machines with hundreds of CPU cores.

For now, how about limiting the default maximum number of threads used by tifffile to the values of environment variables, e.g., TIFFFILE_MAXWORKERS and TIFFFILE_MAXIOWORKERS? That should allow you to fix the issue on your systems.

enricotagliavini commented 1 year ago

Hi, thank you for the quick answer.

I would expect that users writing code to process dozens of files in parallel are aware of potential threading issues, the libraries they use, file storage and access patterns, etc. Especially users of machines with hundreds of CPU cores.

Unfortunately that's not the case. Let me explain our scenario: we are a Bio-medical research institute, the people using the computational resources and writing the code here are scientist, as in natural science, not computer science or anything related to IT. They have very limited experience in IT technologies and they don't know / expect this kind of issues. They expect their code to be single threaded as long as they don't call joblib's Parallel or concurrent.futures or multiprocessing, dask, etc. Yesterday, before filling the bug, the user that was with me while I was investigating her code literally told me "why does tifffile does this by default?" in full surprise. While not ideal, it's more realistic if you assume users are not always 100% aware of the features of your software or didn't read the entire documentation (which is quite long and time consuming). The more experienced people will anyway find the maxworkers setting and start making use of it. You don't really lose users by making it single threaded by default.

Moreover this is a change in behavior compared to the previous version of the software, so, aware or the issue or not, this is still causing a rewrite of the code for every project using tifffile. Think for example at scikit-image, but there are many more.

For now, how about limiting the default maximum number of threads used by tifffile to the values of environment variables, e.g., TIFFFILE_MAXWORKERS and TIFFFILE_MAXIOWORKERS? That should allow you to fix the issue on your systems.

That's what many linear algebra library do and it's equally a nightmare if they decide to parallelize by default as users, once again, are not aware of this behavior. The linear algebra libraries generated an even higher amount of pain due to this issue. In fact I had to configure every single system to export the appropriate environment variables and disable the automatic parallelization for all of them.

Overall if you implement a setting with the environment variables I'll do my best to populate every single machine we have with the appropriate export, but I'm not the only admin here and I know some of my colleagues are against such practice. This can also be easily circumvented if the user accidentally reset the environment before starting the application, or if starting a system service, which are starting with an empty environment and would not load the system defaults. Ultimately I consider such a solution as a workaround, not the most ideal.

In any case, thank you for considering this and for any work you'll do to improve the situation.

Kind regards.

cgohlke commented 1 year ago

Moreover this is a change in behavior compared to the previous version of the software

This is a feature introduced and tuned over several years. The default was and is that tifffile uses up to half the CPU cores if it seems fit. Probably the change you are seeing is the result of tuning. Parallel writing was added more recently.

enricotagliavini commented 1 year ago

Understood, thank you for explaining. I didn't investigate the details, but I can tell we use imread() from tifffile on a very wide scale and we never noticed this behavior until relatively recently and it looks like the cases with a more recent version of tifffile are affected. That's why I'm calling it a change in behavior. Maybe we never hit the condition to trigger it until now, but we are doing this kind of large scale image analysis since years with this kind of image size.

We didn't notice any issue with the writes so far.

I should also add one more thought about the environment variables topic. If you decide to implement it, maybe keep variable names similar to what's already being used by other projects, rather than going for your own name convention. The only case I know of when environment variables can be used to set the number of threads are the algebra libraries, see for example

All of the main algebra library implementation use variables in the form XXXX_NUM_THREADS.

If it would help, I can ask some of our users to share some details about the use cases where parallelization created problems, should that also help with the tuning of the decision whether to parallelize or not. Would this be useful? If so, what kind of information would you want?

Thank you. Kind regards.

cgohlke commented 1 year ago

The only case I know of when environment variables can be used to set the number of threads are the algebra libraries

Also OpenMP, Numba, Numexpr, BLOSC, ITK, Zstandard, libvips. Joblib sets some of those environment variables.

I'll add support for TIFFFILE_NUM_THREADS to the next version of tifffile.

enricotagliavini commented 1 year ago

Thank you. Still, it would be nice to see the automation of the parallelization not to cause issue and require user intervention. If we can be of any help on this, feel free to get in touch

cgohlke commented 1 year ago

I might reconsider disabling threading by default later, but for different reasons. I don't think tifffile is responsible for preventing thread oversubscription issues when run in a joblib or another parallel processing context.

enricotagliavini commented 1 year ago

If any library ever written would do automatic parallelization, it would be simply impossible to work. The reason software works is that parallelization is always used under the final application developer or the end user, so that oversubscription is avoided.

There is unfortunately no API to coordinate this at system level, so that application can avoid oversubscription and limit their own parallelization in coordination with other applications.

This is the same as with RAM memory. Your code doesn't use more then it needs (ideally) because you might blow up the system. It's the same idea.

The lack of awareness of what the rest of the system is doing is the reason why automatic parallelization is, unfortunately, a bad idea. it would be amazing to be able to have automatic parallelization and have less work on the users. Alas...

Anyway, thank you.

cgohlke / tifffile

imread() parallelization issues #215