RhetTbull / osxphotos

Python app to work with pictures and associated metadata from Apple Photos on macOS. Also includes a package to provide programmatic access to the Photos library, pictures, and metadata.
MIT License
1.84k stars 94 forks source link

download missing path is called even when not necessary #1086

Open RhetTbull opened 1 year ago

RhetTbull commented 1 year ago

That's somewhat strage because --download-missing should do nothing - all images were downloaded when it was run (it says 55 missing but the library is entirely locally stored). And --exiftool should do almost nothing because only a handful of files out of 110K had any metadata changes. So, in ideal scenario, --download-missing should have had no effect because there is nothing missing and --exiftool should have negligible impact because it's relevant to just 17 files.

I will try dropping --download-missing - it's there only "just in case" in my setup.

Originally posted by @MaxLyt in https://github.com/RhetTbull/osxphotos/discussions/1083#discussioncomment-6178416

RhetTbull commented 1 year ago

I confirmed that when --download-missing is used, the download missing path (_stage_photos_for_export_with_applescript, etc.) are called even when not necessary.

RhetTbull commented 1 year ago

I think there's actually 2 bugs:

  1. _stage_photo_for_export is called before the check for _should_update_photo
  2. The download missing code is staging the file even when it's not necessary
RhetTbull commented 1 year ago

Confirmed the code is spending a lot of time in the applescript path when it should not be:

[export]
cleanup = true
db = "/Users/user/Pictures/Photos Library.photoslibrary"
directory = "{created.year}/{created.mm}/{created.dd}"
download_missing = true
filename_template = "{original_name}_{created.year}{created.mm}{created.dd}"
keep = [ "/Volumes/OWC_Backup/Photos/user/export.toml",]
ramdb = true
report = "/Volumes/OWC_Backup/Photos/user/export_{today}.db"
sidecar = [ "xmp", "exiftool",]
update = true
verbose = 1
append = true
Exporting 37108 photos ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Processed: 37108 photos, exported: 406, updated: 0, skipped: 41083, updated EXIF data: 0, missing: 77, error: 0
Elapsed time: 2:18:45
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000 9115.246 9115.246 core.py:1393(invoke)
        1    0.000    0.000 9115.246 9115.246 core.py:709(invoke)
        1    0.000    0.000 9115.245 9115.245 decorators.py:37(new_func)
        1    0.000    0.000 9115.245 9115.245 decorators.py:25(new_func)
        1    0.000    0.000 9115.245 9115.245 crash_reporter.py:33(wrapped)
        1    0.763    0.763 9115.245 9115.245 export.py:102(export)
    37108    1.296    0.000 7719.862    0.208 export.py:1694(export_photo)
    38436    0.569    0.000 7385.979    0.192 export.py:2119(export_photo_to_directory)
    38436    1.058    0.000 7338.927    0.191 photoexporter.py:401(export)
    38436    0.370    0.000 4509.920    0.117 photoexporter.py:795(_stage_photos_for_export)
    38436    0.785    0.000 4507.535    0.117 photoexporter.py:956(_stage_photo_for_export_with_applescript)
    38436    0.970    0.000 4498.235    0.117 photoexporter.py:1251(_export_photo_uuid_applescript)
   529706    0.886    0.000 4400.075    0.008 script_loader.py:17(run_script)
   529706 4376.470    0.008 4399.189    0.008 __init__.py:92(call)
   105612    1.163    0.000 3178.140    0.030 __init__.py:1138(export)
   105612    1.455    0.000 3033.812    0.029 __init__.py:519(_export_photo)
   854310 2822.722    0.003 2824.567    0.003 {built-in method posix.stat}
   514358    0.253    0.000 2645.195    0.005 pathlib.py:1008(stat)
   155770    0.216    0.000 2098.695    0.013 pathlib.py:1230(exists)
    38436    0.509    0.000 1341.460    0.035 photoexporter.py:649(_get_dest_path)
    38436    1.232    0.000 1311.668    0.034 photoexporter.py:1354(_write_sidecar_files)
   107258    0.275    0.000  876.047    0.008 __init__.py:989(__init__)
        1    0.406    0.406  656.198  656.198 export.py:2480(cleanup_files)
    37109    0.404    0.000  574.600    0.015 report_writer.py:191(write)
   126819    0.091    0.000  542.345    0.004 pathlib.py:1261(is_file)
   195247  525.199    0.003  525.199    0.003 {method 'commit' of 'sqlite3.Connection' objects}
    37108    0.468    0.000  491.607    0.013 photoexporter.py:1951(_xmp_sidecar)
    37108    0.318    0.000  479.127    0.013 template.py:225(__init__)
    37108    0.247    0.000  478.455    0.013 template.py:369(_compile_from_file)
    37108    0.384    0.000  475.820    0.013 template.py:675(_compile_text)
   105612    0.072    0.000  435.432    0.004 __init__.py:1127(filename)
    37108    0.265    0.000  428.448    0.012 template.py:650(_compile)
    37108    9.760    0.000  289.528    0.008 lexer.py:218(parse)
   154652    0.951    0.000  177.815    0.001 verbose.py:231(rich_verbose_)
   154661    1.290    0.000  176.757    0.001 console.py:1618(print)
    38436    0.156    0.000  174.290    0.005 export.py:2350(get_dirnames_from_template)
    41489    0.913    0.000  163.018    0.004 photoexporter.py:1076(_export_photo)
  6976304   17.760    0.000  159.692    0.000 lexer.py:124(append_node)
  3895682  155.790    0.000  157.899    0.000 {built-in method builtins.next}
27032/5490    0.147    0.000  155.708    0.028 <frozen os>:345(_walk)
7711476/5746711    2.408    0.000  145.839    0.000 {method 'extend' of 'list' objects}
    37108    0.220    0.000  138.360    0.004 codegen.py:31(compile)
519512/37108    2.053    0.000  137.354    0.004 codegen.py:106(__init__)
   120114    0.109    0.000  132.717    0.001 <frozen genericpath>:39(isdir)
12203694/2964400    6.592    0.000  130.777    0.000 console.py:1281(render)
   105612    0.278    0.000  128.679    0.001 __init__.py:60(__init__)
  7458708    3.965    0.000  119.168    0.000 lexer.py:403(match_expression)
3247214/773198    4.083    0.000  108.743    0.000 segment.py:265(split_and_crop_lines)
  1845946    0.586    0.000  108.002    0.000 live_render.py:82(__rich_console__)
1391635/154627    2.380    0.000  105.695    0.001 console.py:1332(render_lines)
RhetTbull commented 1 year ago

Making progress. Cut export time almost in half when using --download-missing.

Before:

Processed: 37108 photos, exported: 406, updated: 0, skipped: 41083, updated EXIF data: 0, missing: 77, error: 0
Elapsed time: 2:18:45

After:

Processed: 37110 photos, exported: 5, updated: 0, skipped: 41488, updated EXIF data: 0, missing: 77, error: 0
Elapsed time: 1:18:59

There's still a flaw in the export logic for the --download-missing case that will, in some cases, cause files to be downloaded when not necessary. The case is: file is missing from disk and user has requested that particular version (e.g. edited) with --update and --download-missing then the file will be downloaded before the check happens to see if the file actually needs to be updated. Fixing this will require major changes to the "should a file be updated" algorithm. I'll have to think about how much time I want to put into fixing this. In general, you should not be using osxphotos for an export if you have "optimize Mac Storage" set (which is what would most likely lead to this condition) because you cannot control which files Photos will keep local and which in the cloud. For a one time export, this is fine as the file would need to be downloaded anyway and if you're using --update, you really should be using "Keep originals on Mac" not "optimize Mac storage".

RhetTbull commented 1 year ago

New opportunities for optimization now that code is not spending lots of time in the download missing path:

368430    0.172    0.000 3550.905    0.010 pathlib.py:1008(stat)
   155766    0.187    0.000 2992.869    0.019 pathlib.py:1230(exists)
    38440    0.292    0.000 2039.234    0.053 photoexporter.py:656(_get_dest_path)
    38440    1.099    0.000 1425.873    0.037 photoexporter.py:1397(_write_sidecar_files)

If I could optimize exists() and stat() this would be a huge speed up, especially for networked drives.

_write_sidecar_files() will be hard to optimize as the code must build the sidecar file in memory and compute the hex digest to determine if the sidecar file is changed and should be written.

RhetTbull commented 1 year ago

If I could optimize exists() and stat() this would be a huge speed up, especially for networked drives.

I did some benchmarking on os.path.exists() / os.stat() with the pathlib.Path() counterparts. The os.X versions are about 33% faster (consistent with what I'd previously heard about Pathlib) however I don't think I can optimize these as the code already uses Pathlib features in quite a few places and the conversion from Pathlib.Path to str to use with the os functions eliminates any speed gains.

Profile results ``` Creating 1000 files in /Volumes/OWC_Backup/Profile/ Profiling 1000 over 1000 files Profiling os.path.exists() for /Volumes/OWC_Backup/Profile/ Time taken: 2.3415390215814114 Profiling exists() for /Volumes/OWC_Backup/Profile/ Time taken: 2.266264562495053 Profiling pathlib.Path().exists() for /Volumes/OWC_Backup/Profile/ Time taken: 3.1254842951893806 Profiling os.path.exists(pathlib.Path()) for /Volumes/OWC_Backup/Profile/ Time taken: 3.1027533151209354 Profiling os.path.isfile() for /Volumes/OWC_Backup/Profile/ Time taken: 2.265186338685453 Profiling isfile() for /Volumes/OWC_Backup/Profile/ Time taken: 2.25291240029037 Profiling pathlib.Path().is_file() for /Volumes/OWC_Backup/Profile/ Time taken: 3.1593630015850067 Profiling os.stat() for /Volumes/OWC_Backup/Profile/ Time taken: 2.220045213587582 Profiling stat() for /Volumes/OWC_Backup/Profile/ Time taken: 2.2054089214652777 Profiling pathlib.Path().stat() for /Volumes/OWC_Backup/Profile/ Time taken: 3.125966629013419 ```
Profile code ```python """Profile os.path to pathlib.Path performance""" import itertools import os import pathlib import time from os import stat from os.path import exists, isfile # Specify the file path for testing file_path = "/Volumes/OWC_Backup/Profile/" FILES = 1000 ITERATIONS = 1000 print(f"Creating {FILES} files in {file_path}") for i in range(FILES): fpath = pathlib.Path(file_path) / f"profile_{i}.txt" fpath.touch() print(f"Profiling {ITERATIONS} over {FILES} files") print(f"Profiling os.path.exists() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() os.path.exists(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling exists() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() exists(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling pathlib.Path().exists() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = pathlib.Path(f"{file_path}/profile_{i}.txt") start_t = time.perf_counter() fpath.exists() stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling os.path.exists(pathlib.Path()) for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = pathlib.Path(f"{file_path}/profile_{i}.txt") start_t = time.perf_counter() os.path.exists(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling os.path.isfile() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() os.path.isfile(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling isfile() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() isfile(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling pathlib.Path().is_file() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = pathlib.Path(f"{file_path}/profile_{i}.txt") start_t = time.perf_counter() fpath.is_file() stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling os.stat() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() stat(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling stat() for {file_path}") total_t = 0 for _, i in itertools.product(range(ITERATIONS), range(FILES)): fpath = f"{file_path}/profile_{i}.txt" start_t = time.perf_counter() stat(fpath) stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") print(f"Profiling pathlib.Path().stat() for {file_path}") total_t = 0 for _, i in itertools.product(range(1000), range(1000)): fpath = pathlib.Path(f"{file_path}/profile_{i}.txt") start_t = time.perf_counter() fpath.stat() stop_t = time.perf_counter() total_t += stop_t - start_t print(f"Time taken: {total_t}") ```
RhetTbull commented 1 year ago

More improvements when using XMP sidecars. The Mako template engine was taking a long time to compile the XMP template which was happening with every photo. The template only actually needs to be compiled once and can then be re-used for each photo (to render the data specific to that photo). Memoizing the compiled template shaved more time off. Down to 35% of the original export time.

Processed: 37113 photos, exported: 2, updated: 0, skipped: 41493, updated EXIF data: 0, missing: 78, error: 0
Elapsed time: 0:47:33
RhetTbull commented 1 year ago

@all-contributors please add @MaxLyt for bugs

allcontributors[bot] commented 1 year ago

@RhetTbull

I've put up a pull request to add @MaxLyt! :tada:

RhetTbull commented 1 year ago

I'm going to leave this open for now as there is still one case where files can be unnecessarily downloaded when used with --download-missing (described above) but for normal use, this should not be a big impact. The primary case identified by @MaxLyt in #1083 where download missing was causing update speed to slow down has been fixed in v0.60.2.

torarnv commented 1 month ago

In general, you should not be using osxphotos for an export if you have "optimize Mac Storage" set (which is what would most likely lead to this condition) because you cannot control which files Photos will keep local and which in the cloud. For a one time export, this is fine as the file would need to be downloaded anyway and if you're using --update, you really should be using "Keep originals on Mac" not "optimize Mac storage".

I'm a bit surprised by this to be honest. My use-case is that I use iCloud photos as my primary photo storage, and have "optimize Mac Storage" enabled on my iPhone, iPad, and Macs, because I don't need to maintain a local copy of every single photo on my devices, nor do I have space to do so either (500+GB, soon 1+TB).

But I do want to make a backup of the photo library to my NAS, in case the iCloud photos ever have a unrecoverable error where I lose access to the photos.

Doing such a backup by enabling "Keep originals on Mac" and pointing the system library to a NAS shared folder is generally advised against. And keeping the originals on my Mac loses the advantage of the RAID setup in my NAS.

Exporting the photos to my NAS nightly via osxphotos seemed like the perfect fit. I would have expected that I could enable "optimize Mac Storage" and osxphotos would determine which files would need to be downloaded based on the metadata that the local Photos database stores about the photos (changed, edited, new, etc).

I'm wondering, which parts of _should_update_photo rely on the downloaded photo itself? What are the steps needed to move the _should_update_photo check earlier in the flow, so that we skip before download? Thanks! :)

RhetTbull commented 1 month ago

@torarnv

There are other tools like iCloud Photos Downloader that are optimized for downloading from iCloud. The OSXPhotos design is based on direct access to the photos library (because the Apple's Photos API is far too limited). OSXPhotos only reaches for the Apple API when absolutely necessary (like downloading from iCloud).

I just looked at the code for PhotoExporter._should_update_photo() and see 2 places that will need to be updated. One is easy (and actually a bug) and the other will take more work:

  1. In the case where the user had used --export-as-hardlink then stops doing so, the code must verify the destination and the source are not the same file. There's a bug in the logic that will cause _should_update_photo to always return True if source is missing when all that's needed is to check the source exists and if so, that it's not the same file and if source does not exists, then verify the destination is valid:
    if not options.export_as_hardlink and (not src or dest.samefile(src)):
        # same file but not exporting as hardlink, should update
               return ShouldUpdate.NOT_HARDLINK_SAME_FILE

Alternatively we could just check that the destination is not a hardlink:

import os

    def is_hardlink(filepath):
        if os.path.exists(filepath):
            stat = os.stat(filepath)
            return stat.st_nlink > 1
        return False

This might fail though if user created a hardlink to the exported file after export. Need to think about this.

  1. If the edited version of the photo is being exported, it is always downloaded. This is because OSXPhotos currently checks the signature of the edited file in the Photos library vs the signature of the file in the export. This data is available in the database for the original image but not for the edited image. It would be better to store and check the photo modification date in the Photos database and update if the date is different. This is not perfect as the modification date gets updated if any metadata such as a keyword, favorite status, etc. changes -- there is no way AFAIK to tell explicitly if the photo was edited.
        if options.edited and (
            not src or not fileutil.cmp_file_sig(src, file_record.src_sig)
        ):
            # edited file in Photos doesn't match what was last exported
            return ShouldUpdate.EDITED_SIG_DIFFEREN

This change isn't too difficult but requires a schema change to the export database which requires extensive testing. I'll add this to the queue but it may be a while. My time for working on OSXPhotos is very limited at the moment and I'm trying to get a big update shipped that improves the osxphotos import feature.

torarnv commented 3 weeks ago

Thanks for the detailed explanation @RhetTbull, and sorry for the delay in replying!

There are other tools like iCloud Photos Downloader that are optimized for downloading from iCloud.

I'm aware of iCloud Photos Downloader, I wrote its original iCloud Photo Library service back in 2016 😃 Unfortunately the 2FA quirks, and relying on the ever changing iCloud Web APIs felt too fragile to put into production, so I've been looking for something more stable lately, and was hoping osxphotos could be just that.

The OSXPhotos design is based on direct access to the photos library (because the Apple's Photos API is far too limited). OSXPhotos only reaches for the Apple API when absolutely necessary (like downloading from iCloud).

I guess this means that osxphotos has similar problems with changes to the internal database schema? Or is that a pretty rare occurence?

  1. If the edited version of the photo is being exported, it is always downloaded. This is because OSXPhotos currently checks the signature of the edited file in the Photos library vs the signature of the file in the export. This data is available in the database for the original image but not for the edited image. It would be better to store and check the photo modification date in the Photos database and update if the date is different. This is not perfect as the modification date gets updated if any metadata such as a keyword, favorite status, etc. changes -- there is no way AFAIK to tell explicitly if the photo was edited.
        if options.edited and (
            not src or not fileutil.cmp_file_sig(src, file_record.src_sig)
        ):
            # edited file in Photos doesn't match what was last exported
            return ShouldUpdate.EDITED_SIG_DIFFEREN

This change isn't too difficult but requires a schema change to the export database which requires extensive testing. I'll add this to the queue but it may be a while. My time for working on OSXPhotos is very limited at the moment and I'm trying to get a big update shipped that improves the osxphotos import feature.

I'm happy to help with this if you haven't gotten to it yet. Is there already an issue for this? Can you point me to a similar change that updates the database schema, that I can base it on? Thanks!

torarnv commented 3 weeks ago

If the edited version of the photo is being exported, it is always downloaded. This is because OSXPhotos currently checks the signature of the edited file in the Photos library vs the signature of the file in the export. This data is available in the database for the original image but not for the edited image. It would be better to store and check the photo modification date in the Photos database and update if the date is different. This is not perfect as the modification date gets updated if any metadata such as a keyword, favorite status, etc. changes -- there is no way AFAIK to tell explicitly if the photo was edited.

What if we store the timestamp of the AdjustmentsInfo in the export db? We could then check if the photo has been modified, and if the AdjustmentsInfo is up to date, we know that we don't need to download the edited photo, and can just update the other properties/exif.

RhetTbull commented 3 weeks ago

I guess this means that osxphotos has similar problems with changes to the internal database schema? Or is that a pretty rare occurence?

Yes, osxphotos must be updated every time a new version of Photos is released (this means annually). Since Catalina, the changes have been pretty minor. Photos version 5.0 (Catalina) was a major rewrite and required significant re-engineering to make osxphotos work.

I have a separate project where I'm working on an interface that uses the Apple API, including undocumented private APIs where needed, to access the library. This should be more robust than the direct database access (but also subject to the whims of Apple as I'm using private APIs).

What if we store the timestamp of the AdjustmentsInfo in the export db? We could then check if the photo has been modified, and if the AdjustmentsInfo is up to date, we know that we don't need to download the edited photo, and can just update the other properties/exif.

That's a possibility -- will look into this.

I'm happy to help with this if you haven't gotten to it yet. Is there already an issue for this? Can you point me to a similar change that updates the database schema, that I can base it on? Thanks!

Thanks. I don't currently have a separate issue -- using this one to track this. I think we might be able to use the modified date (which is currently stored in the database in the photoinfo table as part of the JSON object stored for each photo) to do this. If that doesn't work, we could add an extra column to export_data or create a new table to store this data. When I have time (will be a few days) I'll look through some old PRs to find one that updates the schema.

torarnv commented 3 weeks ago

Thanks @RhetTbull!