Open oPromessa opened 2 years ago
I've considered this in the past but the additional complexity has kept me from looking more deeply into this. osxphotos has been optimized for export to the local mac where you're likely to be I/O bound waiting on the disk. It does use copy-on-write to speed up copying, etc. I'm not familiar with Synology devices but assuming they cache writes so they can handle multiple simultaneous writes and your network is fast enough, the approach you propose might work.
The export code would need to be updated and this is no small feat (I'm currently in the middle a big refactoring of the export code that will make it easier to change). Currently, the export code processes all actions serially photo by photo:
The cleanup phase happens at the end after all exports are done.
It takes about 1.7 hours to process --update phase
This is surprising given you only exported 70 updated photos. Doing these 70 copies in parallel is unlikely to make much of a dent in the update time. My guess is the slow part is checking the file stat of the 30000 images across the LAN & NAS. If this is the slow part, then throwing more processes at may not help much if the NAS can't keep up.
Before tackling a multithreading project I'd suggest doing some profiling to see where the slowdowns are actually occurring.
I'll take a look at your project to see what you did.
Do you space on your local disk to do an export or a partial export? It would be interesting to compare that to the same export on the NAS.
Sorry!!!! rookie mistake I believe the Mac went to sleep half way thru the process. I'll gather more stats with the caffeinate
option and come back to you!
A few more thoughts:
The main obstacle to effective multiprocessing in the current design is that the target filename is not fully known until the actual export happens. The CLI code renders the --directory
and --filename
templates to compute the intended target but then the export code needs to check to see if an edited version will be exported, will side cars be generated, and does the target already exist and if so, is it in the export database and does it match the source photo or is it a name collision. With multiple processes exporting, you'd run the risk of name collisions.
If the code to determine name collisions, etc. happened before calling export, you could deterministically know the target file then you could spawn multiple export processes. One downside with this approach is that you might have some "false" name collisions -- e.g. you thought there would be a name collision but one file was missing and didn't get exported so instead of getting IMG_1234.jpg
and IMG_1234 (1).jpg
, you got IMG_1234 (1).jpg
as the first was missing and didn't get exported. Some of this would be difficult to work around when using --download-missing
but I think having some "false" collisions is better then real collisions.
One other thought -- the export database is implemented in sqlite which handles multiprocessing fine so that shouldn't be an impediment.
Sorry for the delay. Here's a test with 88 pics comparing local SSD vs NAS (100Mbps) vs NAS (1Gbps). Using --update
option.
Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|
1st | 88 | 37.158 | 183.095 | 99.971 | 0.422 | 2.080 | 1.136 |
2nd | 88 | 17.630 | 19.160 | 19.797 | 0.200 | 0.218 | 0.224 |
--1st SSD
Processed: 88 photos, exported: 88, updated: 0, skipped: 0, updated EXIF data: 88, missing: 0, error: 0
Elapsed time: 37.158 seconds
--update SSD
Processed: 88 photos, exported: 0, updated: 0, skipped: 88, updated EXIF data: 0, missing: 0, error: 0
Elapsed time: 17.630 seconds
--1st NAS (100Mbps Network)
Processed: 88 photos, exported: 88, updated: 0, skipped: 0, updated EXIF data: 88, missing: 0, error: 0
Elapsed time: 183.095 seconds
--update NAS (100Mbps Network)
Processed: 88 photos, exported: 0, updated: 0, skipped: 88, updated EXIF data: 0, missing: 0, error: 0
Elapsed time: 19.160 seconds
--1st NAS (1Gbps Network)
Processed: 88 photos, exported: 88, updated: 0, skipped: 0, updated EXIF data: 88, missing: 0, error: 0
Elapsed time: 99.971 seconds
--update NAS (1Gbps Network)
Processed: 88 photos, exported: 0, updated: 0, skipped: 88, updated EXIF data: 0, missing: 0, error: 0
Elapsed time: 19.797 seconds
Thanks! This is useful. I've done some profiling on the export code and found it takes a lot of it's time in the code that validates file names to avoid name collisions so that might be worth a look before tackling something like multithreading. I can imagine that if going across a network all the remote file lookups would slow things down. I was also surprised to see the export db is getting opened/closed with each photo. I'd expected it to stay open so I need to take a look there. Full profile results attached.
osxphotos export ~/Desktop/export --from-date 2021-01-01
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Exporting 3853 photos to /Users/rhet/Desktop/export...
[####################################] 100%
Processed: 3853 photos, exported: 372, missing: 3504, error: 0
Elapsed time: 65.114 seconds
17361339 function calls (16993129 primitive calls) in 83.575 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
4/1 0.000 0.000 83.799 83.799 {built-in method builtins.exec}
1 0.000 0.000 83.798 83.798 <string>:1(<module>)
1 0.148 0.148 83.798 83.798 cli.py:1305(_export)
3853 0.289 0.000 64.202 0.017 cli.py:2634(export_photo)
3866 0.192 0.000 47.759 0.012 cli.py:3043(export_photo_to_directory)
3866 0.230 0.000 45.088 0.012 photoexporter.py:483(export2)
3866 0.043 0.000 40.263 0.010 photoexporter.py:748(_validate_dest_path)
3866 0.142 0.000 39.708 0.010 utils.py:384(increment_filename_with_count)
3866 0.104 0.000 38.827 0.010 utils.py:291(findfiles)
765118 19.986 0.000 32.955 0.000 utils.py:284(normalize_fs_path)
3866 0.671 0.000 32.654 0.008 utils.py:302(<listcomp>)
1 0.001 0.001 17.654 17.654 photosdb.py:92(__init__)
1 8.426 8.426 17.635 17.635 photosdb.py:1618(_process_database5)
7719 0.075 0.000 13.821 0.002 cli.py:3007(_render_suffix_template)
3866 0.041 0.000 13.723 0.004 photoinfo.py:1413(render_template)
3866 0.097 0.000 9.643 0.002 phototemplate.py:354(__init__)
3866 0.034 0.000 9.438 0.002 export_db.py:763(__init__)
765131 8.924 0.000 8.924 0.000 __init__.py:77(__enter__)
79968 8.517 0.000 8.517 0.000 {method 'execute' of 'sqlite3.Cursor' objects}
3866 0.075 0.000 8.321 0.002 export_db.py:773(_open_export_db)
3867 0.168 0.000 7.549 0.002 export_db.py:639(_create_db_tables)
And ordered by total time (time spent in the function itself)
Processed: 3853 photos, exported: 372, missing: 3504, error: 0
Elapsed time: 102.641 seconds
25614407 function calls (25246177 primitive calls) in 119.617 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2172957 44.963 0.000 73.764 0.000 utils.py:284(normalize_fs_path)
2172970 19.521 0.000 19.521 0.000 __init__.py:77(__enter__)
1 8.110 8.110 16.243 16.243 photosdb.py:1618(_process_database5)
2172970 7.747 0.000 7.747 0.000 __init__.py:80(__exit__)
79949 6.504 0.000 6.504 0.000 {method 'execute' of 'sqlite3.Cursor' objects}
3870 4.281 0.001 4.281 0.001 {built-in method posix.listdir}
1 2.996 2.996 5.074 5.074 _photosdb_process_searchinfo.py:37(_process_searchinfo)
3866 1.587 0.000 74.578 0.019 utils.py:302(<listcomp>)
2181137 1.132 0.000 1.132 0.000 {method 'match' of 're.Pattern' objects}
2181625 0.951 0.000 0.952 0.000 {method 'decode' of 'bytes' objects}
3867 0.837 0.000 0.837 0.000 {built-in method posix.mkdir}
1058578 0.773 0.000 1.338 0.000 utils.py:371(normalize_unicode)
1 0.714 0.714 0.722 0.722 _photosdb_process_scoreinfo.py:40(_process_scoreinfo_5)
8105 0.678 0.000 0.678 0.000 {method 'commit' of 'sqlite3.Connection' objects}
3866 0.643 0.000 1.740 0.000 utils.py:303(<listcomp>)
17247 0.628 0.000 0.657 0.000 {built-in method posix.stat}
2172970 0.587 0.000 0.587 0.000 __init__.py:72(__init__)
3877 0.579 0.000 0.579 0.000 {method 'close' of 'sqlite3.Connection' objects}
1 0.525 0.525 0.562 0.562 _photosdb_process_faceinfo.py:173(_process_faceinfo_5)
2616628 0.517 0.000 0.517 0.000 {built-in method builtins.isinstance}
7496 0.472 0.000 0.565 0.000 photoinfo.py:1560(__repr__)
3879 0.437 0.000 0.437 0.000 {built-in method _sqlite3.connect}
2 0.399 0.199 0.399 0.199 {method 'fetchall' of 'sqlite3.Cursor' objects}
3866 0.396 0.000 0.396 0.000 {built-in method posix.rmdir}
11675 0.364 0.000 0.364 0.000 sre_parse.py:111(__init__)
40164 0.335 0.000 0.595 0.000 pathlib.py:64(parse_parts)
Interesting that for the export of 372 files, there were 765118 calls to normalize_fs_path
! Definitely need to take a look at what's going on there.
I've added a hidden --profile
option to the export command that enables profiling. It also has a hidden --profile-sort option to specify one or more sort keys for profiled stats output. I'll include in the next release.
I made some changes to the filename normalization to minimize calls to normalize_fs_path
(which can't be optimized further as it's really only a single line call to an objective-C function). This has reduced the calls to normalize_fs_path
for the test case above more than an order of magnitude from 765118 calls to 11610 calls!
The new --profile flag should make it much easier to start doing some optimization.
Profiling...
Exporting 3855 photos to /Users/rhet/Desktop/export...
Processed: 3855 photos, exported: 372, missing: 3506, error: 0
Elapsed time: 123.597 seconds
Profiling completed
14850806 function calls (14482537 primitive calls) in 149.191 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
80008 25.722 0.000 25.722 0.000 {method 'execute' of 'sqlite3.Cursor' objects}
3868 24.039 0.006 37.314 0.010 utils.py:306(list_directory)
1 9.479 9.479 23.764 23.764 photosdb.py:1618(_process_database5)
733207 8.455 0.000 9.180 0.000 _convenience.py:131(<lambda>)
1 4.946 4.946 7.770 7.770 _photosdb_process_searchinfo.py:37(_process_searchinfo)
3869 3.123 0.001 3.123 0.001 {built-in method posix.mkdir}
3868 2.783 0.001 11.963 0.003 utils.py:314(<listcomp>)
3879 2.328 0.001 2.328 0.001 {method 'close' of 'sqlite3.Connection' objects}
11610 2.283 0.000 4.076 0.000 utils.py:284(normalize_fs_path)
17255 2.193 0.000 2.351 0.000 {built-in method posix.stat}
3881 2.065 0.001 2.065 0.001 {built-in method _sqlite3.connect}
3869 1.944 0.001 1.944 0.001 {built-in method posix.rmdir}
I'll use it once available. I still have a few thousand pics to transfer so I'll have plenty of opportunity to test it.
@oPromessa I've added the --profile to export in v0.44.11. There are a number of hidden options and commands I use for debugging. These can now be exposed by setting the environment variable OSXPHOTOS_SHOW_HIDDEN=1
, for example, this can be done on command line as:
OSXPHOTOS_SHOW_HIDDEN=1 osxphotos help export
to see hidden options for export
or
OSXPHOTOS_SHOW_HIDDEN=1 osxphotos help
to see hidden commands
I found the cause of the thousands of unnecessary sqlite calls to the ExportDB code (see #603). I'll work on that later this week which should cause a bit of a speed up as thousands of sqlite database instances won't need to be needlessly recreated.
@oPromessa I fixed the unneeded sqlite calls. Try version v0.44.13 next time you export photos with the --profile flag. On my quick testing, most of the time is spent in the list_directory
function used to list files for avoiding name conflicts. I can imagine that over a NAS this will be even slower. One possible fix would be to create a virtual filesystem cache (read the entire export directory once at startup) and compare names to the cache. This would only work if no other process was writing to the export directory at the same time but might give a big speed boost.
Profiling...
Profile sort_stats order: ('tottime',)
Exporting 3856 photos to /Users/rhet/Desktop/export...
Processed: 3856 photos, exported: 372, missing: 3507, error: 0
Elapsed time: 90.059 seconds
Profiling completed
15173615 function calls (14805239 primitive calls) in 111.012 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3869 26.692 0.007 40.776 0.011 utils.py:307(list_directory)
1 10.551 10.551 20.583 20.583 photosdb.py:1618(_process_database5)
886150 8.688 0.000 9.427 0.000 _convenience.py:131(<lambda>)
1 3.972 3.972 6.486 6.486 _photosdb_process_searchinfo.py:37(_process_searchinfo)
3869 3.387 0.001 12.814 0.003 utils.py:315(<listcomp>)
3870 2.412 0.001 2.412 0.001 {built-in method posix.mkdir}
11616 2.263 0.000 3.665 0.000 utils.py:285(normalize_fs_path)
2629 1.741 0.001 1.741 0.001 {method 'execute' of 'sqlite3.Cursor' objects}
3869 1.728 0.000 52.752 0.014 utils.py:292(findfiles)
373 1.637 0.004 1.637 0.004 {method 'commit' of 'sqlite3.Connection' objects}
13390 1.587 0.000 1.659 0.000 {built-in method posix.stat}
15836 1.567 0.000 1.567 0.000 photoexporter.py:217(__init__)
15498 1.441 0.000 1.441 0.000 __init__.py:77(__enter__)
7771/3886 1.267 0.000 2.791 0.001 sre_parse.py:493(_parse)
3870 1.231 0.000 1.231 0.000 {built-in method posix.rmdir}
2621683 1.222 0.000 1.222 0.000 {built-in method builtins.isinstance}
899003 1.155 0.000 1.155 0.000 {method 'match' of 're.Pattern' objects}
Edit: I'm overthinking this. In most cases osxphotos doesn't need to know what's in a directory, only whether a file exists so I think this can be sped up by just using is.path.exists()
I will run it later tonight with the main 65K photos library. But meanwhile did a test with 88 pics:
I ran a full initial export of about with --profile
and tottime
for sorting.
I attach two files of two runs with the same config but:
Edit: I'm overthinking this. In most cases osxphotos doesn't need to know what's in a directory, only whether a file exists so I think this can be sped up by just using is.path.exists()
I tried this and immediately realized why I implemented the collision code the way I originally did. What osxphotos currently does is increment the filename if any file stem matches. e.g. if the image being exported is named "IMG_1234.JPG" and there's already an "IMG_1234.PNG" in the destination folder, osxphotos will export the image as "IMG_1234 (1).JPG". This is done to avoid collisions with side car files which in this case might both be named "IMG_1234.xml" for example.
Looking at your logs, the majority of the time is spend on {method 'readline' of '_io.BufferedReader' objects}
which is called by the exiftool code as this is how osxphotos communicates with the exiftool subprocess via a pipe. The NAS is 4X slower (not surprising). One potential speedup is to delay copying the file to the NAS until after exiftool processes it but that makes the export code more complicated.
ncalls tottime percall cumtime percall filename:lineno(function)
ssd: 43548 34.903 0.001 34.903 0.001 {method 'readline' of '_io.BufferedReader' objects}
nas: 43548 127.876 0.003 127.876 0.003 {method 'readline' of '_io.BufferedReader' objects}
I believe it confirms your comments. Attached is the profile.69k.log profile.log run of 69K pics with 1.8K exported pics.
Thanks for the profile data. I think I can cut this time (hopefully significantly) by processing the exiftool updates on the local machine then copying the file to the NAS. I've just finished a series of refactoring that will allow me to insert this into the code fairly easily and have the basic design for this change done.
I've also done some thinking on a basic design for multiprocessing but will need to do some research and testing about sharing data between processes. Currently, the entire Photos database gets processed and stored in a PhotosDB
instance. In its current design, this object is not compatible with pickle which means it can't be shared between processes. It's also a large object so not sure you'd want to pass it around between processes. But the export code all depends on access to PhotoInfo
objects which in turn access the PhotoDB
object.
I was wondering on this flow:
...if the exiftool processing could be done in-memory (of course it may be an issue for large multi gigs videos -- but hey it would be up to the OS to do some swapping to disk itself) in Step 2.:
Read a bit about exiftool programming options but I confess I'm no expert ;)
Your first flow is basically what happens today.
I'm almost done with a new flow that
This will avoid sending any file to the export location until it's "done" and avoid any unnecessary reads or writes from/to the export location.
if the exiftool processing could be done in-memory
It can't. The approach I implemented creates a temporary copy of the file, but assuming you're on APFS file system (Catalina+) and the Photos library is on the main disk (the most likely use case), then osxphotos takes advantage of APFS copy-on-write so the temporary file actually takes no time and the writing to disk doesn't actually happen until the conversion or exiftool update. I thought about using a RAM disk (old school style) but it's almost always better to let the OS manage memory and with APFS copy-on-write, it's just as fast.
Does exiftool allow one to make this operation in one go
It does and this would offer a bit of a speedup as the updates wouldn't have to be written to local disk then copied to the NAS but then it would be hard to do things like --retry
or catch errors. It would also mean that it some cases the file is copied by osxphotos and in others copied by exiftool which makes the code more complex.
Once I finish this next round of refactoring on the exiftool bits, I have an idea for parallel processing. I think true multi-processing will be hard due to the complex nature of the data structures that need to be passed around the code to make that work. However, I think that multiple copies of osxphotos could run in parallel, all using the same export database. I've got an idea for an "export runner" mode that would collect UUIDs of all the photos that need to be exported then parse those up to multiple osxphotos export
processes to manage and spawn those as separate processes. Would be some memory overhead of reading in the database to multiple processes but you'd get speed up from running in parallel. Some complexities with --report
, etc. but those are probably workable.
Initial results from new version 0.45.0. This version doesn't copy the photo to destination until after exiftool (and jpeg conversion) is complete. Following results are for exporting 100 photos to a network drive. With --exiftool
the improvement is about 66% (exactly what would be expected from cutting out 2/3 of the transfer over the network). With both --exiftool
and --convert-to-jpeg
, the speed up is about 50% (I'm sure one limiting factor is my old Macbook).
osxphotos version 0.44.13
exporting 100 photos with --exiftool
time osxphotos export /Volumes/rhet/Public/export --uuid-from-file ~/Dropbox/Code/osxphotos/not_missing_uuid.txt --exiftool --exportdb /private/tmp/osxphotos.db
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Warning: export database '/private/tmp/osxphotos.db' is in a different directory than export destination '/Volumes/rhet/Public/export'
Exporting 100 photos to /Volumes/rhet/Public/export...
[####################################] 100%
Processed: 100 photos, exported: 115, missing: 0, error: 1
Elapsed time: 372.266 seconds
export 100 photos with --exiftool and --convert-to-jpeg
time osxphotos export /Volumes/rhet/Public/export --uuid-from-file ~/Dropbox/Code/osxphotos/not_missing_uuid.txt --exiftool --exportdb /private/tmp/osxphotos.db --convert-to-jpeg
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Warning: export database '/private/tmp/osxphotos.db' is in a different directory than export destination '/Volumes/rhet/Public/export'
Exporting 100 photos to /Volumes/rhet/Public/export...
[####################################] 100%
Processed: 100 photos, exported: 115, missing: 0, error: 0
Elapsed time: 527.868 seconds
osxphotos version 0.45.0
export 100 photos with --exiftool
time python -m osxphotos export /Volumes/rhet/Public/export --uuid-from-file ~/Dropbox/Code/osxphotos/not_missing_uuid.txt --exiftool --exportdb /private/tmp/osxphotos.db
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Warning: export database '/private/tmp/osxphotos.db' is in a different directory than export destination '/Volumes/rhet/Public/export'
Exporting 100 photos to /Volumes/rhet/Public/export...
[####################################] 100%
Processed: 100 photos, exported: 115, missing: 0, error: 1
Elapsed time: 137.268 seconds
export 100 photos with --exiftool and --convert-to-jpeg
time python -m osxphotos export /Volumes/rhet/Public/export --uuid-from-file ~/Dropbox/Code/osxphotos/not_missing_uuid.txt --exiftool --exportdb /private/tmp/osxphotos.db --convert-to-jpeg
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Warning: export database '/private/tmp/osxphotos.db' is in a different directory than export destination '/Volumes/rhet/Public/export'
Exporting 100 photos to /Volumes/rhet/Public/export...
[####################################] 100%
Processed: 100 photos, exported: 115, missing: 0, error: 0
Elapsed time: 247.882 seconds
I tried running two osxphotos exports at the same time, splitting the UUIDs between them. For the --exiftool
scenario above, this reduced the run time from 137 seconds to 102 seconds so this may be a viable solution to get some improvement. osxphotos could have a mode that opens a pipe to each process to monitor overall progress (and perhaps a special option for export that outputs progress in json format for easy consumption by the monitoring process).
Exporting 50 photos to /Volumes/rhet/Public/export...
Processed: 50 photos, exported: 57, missing: 0, error: 1
Elapsed time: 102.416 seconds
Exporting 50 photos to /Volumes/rhet/Public/export...
Processed: 50 photos, exported: 58, missing: 0, error: 0
Elapsed time: 101.908 seconds
This is amazing!!! Will try the new version and provide feedback.
Super results in my 88 pics testing.
.osxphotos
database file is on its default location.osxphotos version | Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|---|
0.44.13 | 1st | 88 | 37.158 | 183.095 | 99.971 | 0.422 | 2.080 | 1.136 |
0.44.13 | 2nd | 88 | 17.630 | 19.160 | 19.797 | 0.200 | 0.218 | 0.224 |
0.45.0 | 1st | 88 | - | 94.290 | 53.304 | - | 1.071 | 0.605 |
0.45.0 | 2nd | 88 | 0.576 | 6.922 | 3.774 | 0.006 | 0.078 | 0.042 |
time
and it does reference a different value from Elapsed
. The first is measuring the total time and the second is possibly measuring the export phase only. $ time bash runit_nas.bash
real 0m42.515s
user 0m21.059s
sys 0m4.283s
(venv)
$ egrep "Processed|Elapsed" so.log
Processed: 88 photos, exported: 0, updated: 0, skipped: 88, updated EXIF data: 0, missing: 0, error: 0, touched date: 0
Elapsed time: 6.922 seconds
A. NAS (100) 1st run profile
ncalls tottime percall cumtime percall filename:lineno(function)
179 45.571 0.255 45.584 0.255 fileutil.py:93(copy)
43548 35.576 0.001 35.576 0.001 {method 'readline' of '_io.BufferedReader' objects}
1 15.064 15.064 33.609 33.609 photosdb.py:1618(_process_database5)
1 8.832 8.832 13.845 13.845 _photosdb_process_searchinfo.py:37(_process_searchinfo)
B. NAS (1G) 1st run profile
ncalls tottime percall cumtime percall filename:lineno(function)
43548 35.254 0.001 35.254 0.001 {method 'readline' of '_io.BufferedReader' objects}
1 14.393 14.393 32.979 32.979 photosdb.py:1618(_process_database5)
1 8.863 8.863 13.914 13.914 _photosdb_process_searchinfo.py:37(_process_searchinfo)
179 7.023 0.039 7.035 0.039 fileutil.py:93(copy)
5.Will run now the 80K file upload/update on to the NAS(100).
Great!
One thing though, I timed the execution of the 2nd run on NAS(100) with time and it does reference a different value from Elapsed.
Yes, the elapsed time only shows the time from when osxphotos begins exporting photos, not any of the setup time. Perhaps I should change that to avoid confusion?
@allcontributors add @oPromessa for ideas, test
@RhetTbull
@oPromessa already contributed before to ideas, test
@RhetTbull
I've put up a pull request to add @oPromessa! :tada:
So after a lot of hacking, I think I have multiprocessing sort of working. There were way too many things in the PhotosDB() object that couldn't be pickled (a requirement for multiprocessing) and I didn't want to refactor it completely so I made a compromise. I settled for a two pass approach:
PhotoInfo
objects but then can export in parallelOn my 2013 MacBook Pro (2.4 GHz Dual-Core Intel Core i5), this does appear to provide a modest improvement in export speed:
Single Process
time python -m osxphotos export ~/Desktop/export --uuid-from-file uuid.txt --exiftool --convert-to-jpeg --report single_proc.csv
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Exporting 595 photos to /Users/rhet/Desktop/export...
[####################################] 595/595
Processed: 595 photos, exported: 601, missing: 32, error: 0
Elapsed time: 0:03:12
python -m osxphotos export ~/Desktop/export --uuid-from-file uuid.txt 96.87s user 30.02s system 58% cpu 3:38.15 total
Two processes
time python -m osxphotos export ~/Desktop/export --uuid-from-file uuid.txt --exiftool --convert-to-jpeg --report multi_proc.csv --multiprocess 2
Using last opened Photos library: /Users/rhet/Pictures/Photos Library.photoslibrary
Exporting 595 photos to /Users/rhet/Desktop/export...
Starting process number #0
Starting process number #1
Process 0 is done
Process 1 is done
All processes finished
Processed: 595 photos, exported: 583, missing: 1, error: 0
Elapsed time: 0:02:28
python -m osxphotos export ~/Desktop/export --uuid-from-file uuid.txt 2 126.20s user 39.10s system 97% cpu 2:49.35 total
You'll notice the total exported is different between the two runs so I need to figure out the difference by examining the export reports but I think this I'm getting close to getting this working correctly.
Update: I fixed the discrepancy between number of photos.
Success!
Let me know how I can help in the testing of this. As I believe the branch you're using is not accessible to me!
I pushed the branch to Github: https://github.com/RhetTbull/osxphotos/tree/multiprocessing
Happy for any feedback or code contributions if you're feeling inclined. It's a rough alpha at the moment--got it working but need to clean up the code quite a bit (most of the changes are in cli.py:
The biggest issue I'm aware of at the moment is that there's a race condition for filename if multiple processes are trying to export a file with the same name to the same location at once. I have a solution in mind but not yet implemented: "reserve" the file with an empty filename preceded by a .
. e.g. if file being exported is IMG_1234.jpg
, export code would create an empty file named .IMG_1234.jpg
to reserve the filename then remove it upon successful export.
Because each of the processes need to load the entire Photos database this is likely to only improve performance for exports where there are a large number of files. It's likely that performance with --convert-to-jpeg
and/or --exiftool
may improve as these are CPU-bound and not I/O-bound.
Results with multiprocessing under rows "0.45.3 mp=x".
mp=8
it was rather strange as the progress bars
for all processes were complete but it took quite sometime wrapping up to present All processes finished
. will keep testing
osxphotos version | Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|---|
0.44.13 | 1st | 88 | 37.158 | 183.095 | 99.971 | 0.422 | 2.080 | 1.136 |
0.44.13 | 2nd | 88 | 17.630 | 19.160 | 19.797 | 0.200 | 0.218 | 0.224 |
0.45.0 | 1st | 88 | - | 94.290 | 53.304 | - | 1.071 | 0.605 |
0.45.0 | 2nd | 88 | 0.576 | 6.922 | 3.774 | 0.006 | 0.078 | 0.042 |
0.45.3 | 1st | 88 | - | - | 52 | - | - | 0.590 |
0.45.3 mp=2 | 1st | 88 | - | - | 52 | - | - | 0.590 |
0.45.3 mp=4 | 1st | 88 | - | - | 53 | - | - | 0.602 |
0.45.3 mp=8 | 1st | 88 | - | - | 104 | - | - | 1.181 |
0.45.3 mp=4 | 1st | 203 | 65 | - | 91 | 0.320 | - | 0.448 |
Thanks for the data. I'm not surprised that 0.45.x is faster as it's avoiding unnecessary round trips to the NAS so this is in family with what I'd expect. The multiprocessing data is interesting as it indicates that either the overhead of loading the library is dominant or the export is I/O bound. It's probably worth instrumenting the multiprocessing code such that it reports time metrics on how long it took to load the library and how long it spent in export. Unless multiprocessing shows a huge improvement for a large export, I'm not sure it's worth spending too much time on the multiprocessing feature.
on the run with mp=8 it was rather strange as the progress bars for all processes were complete but it took quite sometime wrapping up to present All processes finished
Interesting -- I'll take a look and see if I can figure out what's going on. This is the very first python multiprocessing code I've written so quite possible I've done something wrong or inefficient. The progress bar (uses rich.progress) should exit when all tasks have reported done or when no other processes are active (there are two nested loops):
Ran a test with a larger set of pics: 1304 Multiprocessing now shows its potential.
But you were right that some conflict could exist on duplicate filenames.
This test dataset is also tricky as it is the one with multiple pics with the same filename on the same album.
Masked some names with (...)
2: ('Writing metadata with exiftool for AAD019 (5).jpg',)
(...)
Error exporting photo (51E8424E-627C-49DD-ADA4-08FFAAC8DD82: AAD019.jpg) as AAD019.jpg: Error copying file /var/folders/28/x8vbdy4j4vn680b5pt33hs0w0000gn/T/osxphotos_photo_exporter_51E8424E-627C-49DD-ADA4-08FFAAC8DD82_2kcq2cxg/AAD019.jpg to /Volumes/photo/Test.Album/(...)/AAD019 (4).jpg: Error Domain=NSCocoaErrorDomain Code=516 "“AAD019.jpg” couldn’t be copied to “(...)” because an item with the same name already exists." UserInfo={NSSourceFilePathErrorKey=/var/folders/28/x8vbdy4j4vn680b5pt33hs0w0000gn/T/osxphotos_photo_exporter_51E8424E-627C-49DD-ADA4-08FFAAC8DD82_2kcq2cxg/AAD019.jpg, NSUserStringVariant=(
Copy
), NSDestinationFilePath=/Volumes/photo/Test.Album/(...)/AAD019 (4).jpg, NSFilePath=/var/folders/28/x8vbdy4j4vn680b5pt33hs0w0000gn/T/osxphotos_photo_exporter_51E8424E-627C-49DD-ADA4-08FFAAC8DD82_2kcq2cxg/AAD019.jpg, NSUnderlyingError=0x600001e3c510 {Error Domain=NSPOSIXErrorDomain Code=17 "File exists"}} (photoexporter.py: 1127)
Retrying export for photo (51E8424E-627C-49DD-ADA4-08FFAAC8DD82: AAD019.jpg)
osxphotos version | Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|---|
0.45.3 | 1st | 1304 | - | - | 672 | - | - | 0.515 |
0.45.3 mp=4 | 1st | 1304 | - | - | 290 | - | - | 0.222 |
But you were right that some conflict could exist on duplicate filenames.
I've made some changes to the multiprocessing branch to use lock files to effectively "reserve" a filename which should minimize if not eliminate file name conflicts. When you get a chance, I'd appreciate you re-running the larger export set to see if you still get name conflicts.
Some quick feedback:
WARNING
messages on database locked.osxphotos version | Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|---|
0.45.3 | 1st | 1304 | - | - | 672 | - | - | 0.515 |
0.45.3 mp=4 | 1st | 1304 | - | - | 290 | - | - | 0.222 |
0.45.8 mp=4 | 1st | 1304 | - | - | 413 | - | - | 0.316 |
Ugh...that's a lot slower(~42%) with the lock files. The lock file code is ugly and could result in stray lock files if the app crashes, etc. I may rip all that out--toying with another solution: start a local process that listens on a socket (e.g. just a mini web server with an API) and communicate over the socket to "claim" a filename. This would likely be much faster than creating and checking for existence of lock files over the NAS and no possibility of residual lock files. This adds some complexity as I wouldn't want to mess with the server if in single process mode (where there's no risk of collision).
Maybe you're more comfortable with a mini web server +API, but wouldn't a control process and then use Queues and locks from multiprocessing
do the trick?
Update: I now got your point to reserve/lock the filename on the filesystem. Other than that yes, it would need a placeholder to reserve the filenames in memory but then ensure that upon writing them there is no conflict. Hmmm. considerable task! ;)
Maybe you're more comfortable with a mini web server +API, but wouldn't a control process and then use Queues and locks from multiprocessing do the trick?
Yes, I'm just not familiar with multiprocess programming and find a server/client easy to understand. I'll read some more about the multiprocessing package -- maybe a lock that could be acquired by the file naming method could work.
However, one consideration is I want to minimize the places in the code that have to be "aware" of multiprocessing to not further complicate the code. Currently, only a couple of methods in cli.py
need to deal with multiprocessing. The actual code that does the export (in PhotoExporter
class) doesn't need to know whether or not it's running in a single process or several. Adding things like locks or interprocess queues means I have to pass that state around and then python starts to get really whiny about which things can be pickled and which can't.
One possible solution is the dependency injection method I use for --dry-run
. The PhotoExporter
class gets an instance of an ExportDB
class which handles the export database and a FileUtil
class which handles things like copy()
, move()
, unlink()
, etc. If in dry run mode, PhotoExporter gets no-op versions of those classes so all the export logic runs just as it would in the actual export but nothing actually gets exported. I suppose the FileUtil
class could be modified to add a FileUtilMultiProcess
version which took care of the file renaming by acquiring a lock. I'll have to think about this and read up on using locks.
Other than that yes, it would need a placeholder to reserve the filenames in memory
I think this would suffice if all file names were obtained through a common method that kept track of which names had been given out. If this was done with a lock, it could be guaranteed that only a single instance of a name was ever handed out.
0.47.1
.osxphotos_export.db
on the remote NAS server: we know it's slow! A bit slower than 0.45.3
--ramdb
option we have a whopping performanceosxphotos version | Run | Pics | SSD | NAS (100) | NAS (1G) | Sec/Pic SSD | Sec/Pic NAS (100) | Sec/Pic NAS (1G) |
---|---|---|---|---|---|---|---|---|
0.45.3 | 1st | 1304 | - | - | 672 | - | - | 0.515 |
0.45.8 mp=4 (locking) | 1st | 1304 | - | - | 413 | - | - | 0.316 |
0.47.1 | 1st | 1304 | - | - | 700 | - | - | 0.536 |
0.47.1 --ramdb | 1st | 1304 | - | - | 390 | - | - | 0.299 |
multiprocessing
with 0.47.1 master
branches to run some more tests!0.47.1
profile resultsProfiling completed
89654371 function calls (78067164 primitive calls) in 771.205 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
227172 211.915 0.001 211.915 0.001 {method 'readline' of '_io.BufferedReader' objects}
5219 195.555 0.037 195.555 0.037 {method 'commit' of 'sqlite3.Connection' objects}
13504 123.162 0.009 123.162 0.009 {method 'execute' of 'sqlite3.Cursor' objects}
2611 104.748 0.040 105.405 0.040 fileutil.py:93(copy)
3147 35.444 0.011 35.444 0.011 {built-in method posix.listdir}
1 14.478 14.478 31.781 31.781 photosdb.py:1576(_process_database5)
22867 10.413 0.000 10.447 0.000 {built-in method posix.stat}
4614293/92738 8.221 0.000 26.330 0.000 __init__.py:234(parse)
1 7.820 7.820 12.533 12.533 _photosdb_process_searchinfo.py:37(_process_searchinfo)
2608 6.324 0.002 6.324 0.002 {built-in method posix.utime}
2674454/92738 3.647 0.000 25.533 0.000 __init__.py:353(_parse)
851611/92737 2.817 0.000 6.027 0.000 model.py:477(process_node)
4348 2.662 0.001 2.662 0.001 {method 'fetchone' of 'sqlite3.Cursor' objects}
3910 2.142 0.001 214.343 0.055 exiftool.py:305(run_commands)
4386876 2.062 0.000 3.342 0.000 utils.py:418(normalize_unicode)
2440290/1500458 1.960 0.000 2.734 0.000 __init__.py:108(flatten)
1760147/1759614 1.771 0.000 8.185 0.000 __init__.py:759(parse)
0.47.1 --ramdb
profile resultsProfiling completed
89609322 function calls (77959587 primitive calls) in 467.076 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
220397 251.014 0.001 251.014 0.001 {method 'readline' of '_io.BufferedReader' objects}
2527 74.567 0.030 75.771 0.030 fileutil.py:93(copy)
2781 25.982 0.009 25.982 0.009 {built-in method posix.listdir}
1 14.858 14.858 32.100 32.100 photosdb.py:1576(_process_database5)
1766 14.523 0.008 14.523 0.008 {built-in method posix.unlink}
23482 10.171 0.000 10.204 0.000 {built-in method posix.stat}
4639493/93242 8.086 0.000 26.028 0.000 __init__.py:234(parse)
1 7.781 7.781 12.435 12.435 _photosdb_process_searchinfo.py:37(_process_searchinfo)
2524 4.165 0.002 4.165 0.002 {built-in method posix.utime}
2 4.134 2.067 4.134 2.067 {method 'executescript' of 'sqlite3.Cursor' objects}
2689070/93242 3.563 0.000 25.232 0.000 __init__.py:353(_parse)
856147/93241 2.822 0.000 5.956 0.000 model.py:477(process_node)
3742 2.097 0.001 253.388 0.068 exiftool.py:305(run_commands)
4386876 2.068 0.000 3.332 0.000 utils.py:418(normalize_unicode)
2453394/1508522 1.938 0.000 2.684 0.000 __init__.py:108(flatten)
1769723/1769190 1.790 0.000 8.178 0.000 __init__.py:759(parse)
Great! The export code refactor in 0.46 requires more read/writes to the database so not surprised it's a bit slower but it will be much more robust.
Will see if I can merge multiprocessing with 0.47.1 master branches to run some more tests!
That may be difficult as I've refactored bug sections of the affected code. It might be easier to add the multiprocessor code back in manually.
It might be worth adding a checkpoint option that writes the export data base from RAM periodically.
By the way, I don't think --ramdb will be compatible with multiprocessing as each process would get a copy of the database.
Though I have way too many projects at the moment, I had an idea last night and I think I have a new design that could resurrect multiprocessing without too much rework and without the cost of the previous design which had to re-read the database inside each process.
The primary limitation for multiprocessing at the moment is the fact that the PhotosDB and PhotoInfo objects can't be serialized and sent between processes. It would be possible to create a new PhotoInfo subclass that instead of reading the database, reads a JSON serialization of a PhotoInfo object. The ability to export to JSON is already inbuilt in the PhotoInfo.json()
, need to add the ability to instantiate a PhotoInfo object from the JSON then the JSON could be passed between processes (or just written to a temp SQLite database that all processes could read) and all the rest of the code would work as expected.
Is your feature request related to a problem? Please describe.
Currently processing ~33000 pics of a Photos DB. The full Photos size is ~90000.
--update
phasedelete
phaseSo, if we extrapolate it could take up to 15 hours to process 90000 pics.
Describe the solution you'd like Maybe launch in parallel multiple copy tasks? Add a flag for
--multiprocessing <n>
with the level of processes to launch. To achieve this in other scenarios I've used with relative success the librarymultiprocessing
on https://github.com/oPromessa/flickr-uploader/blob/master/lib/mprocessing.py Of course the code must be ware of multiprocessing and use locks to access database and will be dependent of the performance of the underlying HW and LAN.Describe alternatives you've considered Hmmm. Launch multiple osxphotos at the same time each with it's own selection/filter process. Problem is that, if each one uses its own
.osxphotos_db
to avoid concurrency you'll end up deleting each other's pics.Additional context I'm copying over a 100Mbps lan (will upgrade to 1Gbps lan shortly) from an iMac (i7, 16GB, ssd) into a Synology DSM 918+.