0xCCF4 / PhotoSort

A tool to rename/move/copy/hardlink/symlink and sort photos and videos by its EXIF date.
GNU General Public License v3.0
5 stars 1 forks source link

[Feature request] Improving throughput with concurrency #47

Open roykrikke opened 5 days ago

roykrikke commented 5 days ago

Below is a copy of a feature inside Phockup project. If you have a multicore system, this tool can significantly speed up the processing of large numbers of photos or videos. A similar concept could potentially be implemented in PhotoSort.

I’m not suggesting it be implemented in exactly the same way, but rather sharing the idea to illustrate the potential benefits.


Improving throughput with concurrency If you want to allocate additional CPUs/cores to the image processing operations, you can specify additional resources via the --max-concurrency flag. Specifying --max-concurrency=n, where n represents the maximum number of operations to attempt concurrently, will leverage the additional CPU resources to start additional file operations while waiting for file I/O. This can lead to significant increases in file processing throughput.

Due to how concurrency is implemented in Phockup (specifically ThreadPoolExecutor), this option has the greatest impact on directories with a large numbers of files in them, versus many directories with small numbers of files in each. As a general rule, the concurrency should not be set higher than the core-count of the system processing the images.

--max-concurrency=1 has the default behavior of no concurrency while processing the files in the directories. Beginning with 50% of the cores available is a good start. Larger numbers can have diminishing returns as the number of concurrent operations saturate the file I/O of the system.

Concurrently processing files does have an impact on the order that messages are written to the console/log and the ability to quickly terminate the program, as the execution waits for all in-flight operations to complete before shutting down.

0xCCF4 commented 4 days ago

I though of implementing multithreading but decided against it sofar.

My reason was, that the main actions photo_sort does are IO operations:

So, I believe that the bottleneck of the execution speed will likely be the disk speed; hence multithreading would not provide significant speed improvements.

The only situation I could imagine that would benefit from multithreading would be:

What do you think about this?

roykrikke commented 4 days ago

I though of implementing multithreading but decided against it sofar.

My reason was, that the main actions photo_sort does are IO operations:

  • IO: list a directory
  • IO: read the filename of all files within
  • IO: read image files and parse their EXIF informations
  • compute a target filename
  • IO: rename/move/copy a file

So, I believe that the bottleneck of the execution speed will likely be the disk speed; hence multithreading would not provide significant speed improvements.

The only situation I could imagine that would benefit from multithreading would be:

  • a fast SSD with the source directory
  • multiple slower disks that are mounted within the target directory In that situation multiple copy operations might be scheduled at the same time. Since the bottlneck would be the IO speed of the slower drives.

What do you think about this?

Of course I understand your feedback. I have a nice (yes heavy overkill) 10 Gig network with ZFS storage server at home. Where I can easily read and write 600MB/s because I have enough memory and SSD caching inside my ZFS server.

I think it could be a win, but I can also imagine that this request has a lower priority or no priority han the other feature requests I've made.