Multithreaded / Multiprocessing support

RhetTbull commented 1 year ago

          @RhetTbull, I have reviewed the branch and if I correctly understand, the core idea of changes are the following:

A thread pool is used to process each photo separately. The list of photos is prepared upfront, as in the standard version.
ExifTool is run in a pool of subprocesses, and the pool acts a sort like a global object. I'm not sure what for the threading lock is used in ExifTool, as its instances are used as context managers, so they are not shared between other threads, and the lock inside of it is used only once.

In general, using the thread pool could be expected to lead to increased performance, however, in practice it can cause the opposite, because of 2 things: 1) coordination of execution of many small tasks can eat performance gains, and 2) Python uses GIL and does not have such threads as would be expected. So, if tasks are not merely IO-bound, results can be worse than in the case of a single thread. It's hard to estimate, if threads can help here, as there are many things beyond just copying are done in the code.

As for sharing the PhotosDB between the processes, if I correctly understand how export works, there can be no need for that. The data in that object is a real treasure, the slow startup caused by it is not that bad, however, there is always a room for improvement. From the code it seems like that object converts the db of the library into a list of PhotoInfo objects. I would consider serialization of that objects into several independent files, and then running export jobs in several processes, feeding each file to each process as a task queue, and avoiding communication between processes as much as possible.

Also, as the export is a disk-heavy operation, there might be no gains from parallelization for certain types of devices, like HDDs, which do not support parallel accesses.

It would be nice to hear back your considerations if possible. And, maybe, it would make sense to move such conversations to a separate ticket, as it got a bit off-topic.

Originally posted by @oblalex in https://github.com/RhetTbull/osxphotos/issues/625#issuecomment-1542696021

RhetTbull commented 1 year ago

@oblalex You summarized the changes well. My initial performance tests showed that multithreaded export did not offer much advantage (and in some cases was slower) -- likely for the reasons you listed. I think you're right about the thread lock in the exiftool runner -- it's probably not needed. The exiftool code still needs work. The way exiftool works today in osxphotos is that the ExifTool class creates a singleton exiftool process that stays alive for as long as osxphotos is running. This avoids the startup cost associated with launching a subprocess (and one that's written in perl and thus needs the perl executable to launch). I think in any multithreaded/multiprocess implementation we'd need a queue of these runners to use so each thread isn't launching exiftool itself. exiftool is a big part of the workflow for many users and it's also one of the places where gains could be made by parallelizing operations.

After disappointing results with multithreading, I was thinking of next trying multiprocessing. This will take more work for a couple of reasons:

sqlite connections cannot be shared between processes. This primarily affects the export database which osxphotos uses (sqlite) to maintain state of the export and to know which photos to update. Each process would need its own connection to the export database (ExportDB class) or another means of accessing state.
The PhotoInfo objects all have a reference to the parent PhotosDB object and this cannot be pickled to be shared amongst processes. One reason is the PhotosDB maintains a connection the Photos database because a handful of properties (that might be needed during export) are accessed dynamically via SQL queries. In the multiprocessing branch, I get around this by creating a "proxy" object for the PhotoInfo class that is created by serializing PhotoInfo to JSON and deserializing into a new data class that has all the needed properties but no external references. Using something like orjson (see #1060) might speed this up. These proxy objects can be passed around easily between processes.
The ExportDB makes the code easy to use/understand in the single threaded version because all state is preserved in one place and you can start and stop an export without losing state. osxphotos also provides a number of tools (see osxphotos exportdb) for working directly with the export database to generate reports after the export, etc. However for multiprocessing this will be a bottle neck (though SQLite does support multiple access but database would need to be locked for each transaction). One possible solution might be to do a "first pass" through the export and generate all the work to be done up front: whether or not a file should be exported, what the export name should be, etc. The code could use the export database for this but then save the results as a "task list" to be handed out to multiple processes. The processes thus wouldn't need access to either the Photos database or the export database as the data would be pregenerated. When a process finished, it's state could be written by the main thread to the export database. This would result in slower start up time but possibly faster overall execution. A possible negative is that any changes made externally to the export destination would not be captured thus it would be very important to warn users not to touch the export destination until export was complete. Today, if a user added a file or changed a name in the export destination during export, this would be noticed and the output names for exported files adjusted to avoid name conflicts. This is admittedly an edge case. The "first pass" would require restructuring the export code (which is fairly complex) because some of the "decision" on what to do during export happens inside the export code. The main CLI script says "export this photo" then the PhotoExporter looks at the photo to determine things like: does it have an associated Live video, has the metadata changed and thus requires exiftool to rewrite the metadata?, etc. This would need to be completely rewritten to separate the code that actually copies the photo and runs exiftool or user supplied functions from the code that determines what to export.

I'm happy to explore this as time allows because it's an interesting engineering problem but its low priority for me because it's not a use case that I need. I run osxphotos in the background to create backups and it doesn't really matter to me how long these take. I have spent a lot of time to ensure the export code is correct and thus I dread a little opening it up to do all the refactoring needed for item 3 above.

RhetTbull / osxphotos

Multithreaded / Multiprocessing support #1069