Mitigate invalid cross-device link errors

jcreixell commented 11 months ago

When downloading photos, a rename operation is used to move tmp files to their final destination. This operation won't work when soft links are used between devices, resulting in a Error: 18 Invalid cross-device link error.
Cross device links can be useful in different scenarios. One example is storing RAW files in a different device, using soft links and extension-based subfolder generation.
A possible mitigation is to try to use rename (fast) and fall back to a full file copy when such error is found. This provides a good tradeoff between performance and flexibility.
Solution based on https://bugs.launchpad.net/rapid/+bug/1815546/comments/5
Fixes https://github.com/damonlynch/rapid-photo-downloader/issues/100

damonlynch commented 11 months ago

Thanks for the pull request @jcreixell ! Very helpful considering the typing problem I currently have with my hands. (To be clear, the problem is not so much writing the code, because I can use voice recognition under Windows. The problem is testing the code, where I cannot use that solution).

Back to the pull request: it will be much better performance wise to write the temporary files into the same file system as the destination file. Short of doing a test write before commencing downloading, is there a way to programmatically determine where the ultimate destination is, and then use that to determine the optimal location of the temporary directory into which the temporary files will be downloaded? If my question is not clear, just let me know.

Please let me know if you need a high level overview of why the temporary files are used. In short, temporary files are necessary because the program can download from multiple devices simultaneously, and it is necessary to handle the possibility of conflicting filenames, i.e. more than one device has a file with the same filename, and they are both being downloaded from at the same time.

jcreixell commented 11 months ago

hi @damonlynch, thanks for the comment and sorry to hear about your hands.

I see what you mean, let me give it some thought. It could get a bit complicated because there could be multiple file systems mounted in the directory tree (even nested), so the tmp dir would need to be determined on a file to file basis (and I imagine, cleaned up, etc).

damonlynch commented 11 months ago

Specifically, it would need to be determined on a file extension basis, if and only if the file extension is used to generate subfolder names. Otherwise, then it should be a simpler: all photos will go to the same file system, regardless of file extension. And the same for videos, of course.

Edit: if the user is doing something unexpected like having some subfolders on one file system, in some subfolders on another, then that's another complexity to take into account. With edge cases that are a consequence of weird setups like this, It might be best to fallback to the code you have suggested in your initial pull request.

jcreixell commented 11 months ago

Yes, I think you are right. For context, my concrete example is as follows:

I use ZFS and have the following storage pools:

NAME              USED  AVAIL     REFER  MOUNTPOINT
zpool             735G  1.04T      511M  /zpool
zpool/Fotos       239G  1.04T      237G  /zpool/Fotos
zpool/Raw         356G  1.04T      356G  /zpool/Raw

I like to keep my RAW files separated from my JPG, since they take much more space. That way, I can manage backups and storage capacity independently for each format (with ZFS, it is possible to add additional devices to individual storage pools to increase their capacity).
I use a pattern similar to this one for importing my photos: <Extension (lowercase)>/<Image date (YYYY)>/<Image date (YYYY-MM-DD)>
I use soft links to be able to make the import work with RPD:

➜ Fotos ls -l
...
drwxr-xr-x ... jpg
drwxr-xr-x ... raf -> ../Raw/raf

Note that somebody else may want to do something similar, but using separate mount points by camera model, year, etc.

I did a bit of research and it should be possible to use something like os.path.ismount() to determine the mount point of a given download path. The path could be walked up recursively until a mount point is found. This would need to be done for every download path for photos being imported (could be cached for efficiency).

Once a mount point is found, we could try to use a temporary folder for that filesystem, but it becomes a bit tricky. For example, if a single filesystem is used, the mount point would be the root directory /, which is probably not writable (although /tmp might be). For other devices, the root directory might or might not be writable, and it is hard to determine where to write (unless manually configured).

All in all, I think even if we make it work, it would be quite complicated and potentially error prone. For this reason, I would be inclined to use the simpler solution for such advanced use cases, while keeping it super simple for most users.

It is very likely I am missing something, my understanding of how RPD works is quite superficial.

jcreixell commented 11 months ago

Btw I just realized that shutil.move already attempts to use rename and falls back to full copy for cross-device moves, so we just need to replace os.rename with shutil.move and that should be it. I can tweak the code for this PR tomorrow, in case we decide to move ahead with it

damonlynch commented 11 months ago

An high level overview of how the program works under the hood is found on the program website, here. I strongly recommend reading it.

A few additional points:

Think of a download from a device like a camera or memory card as like a fire hose with a large volume of water. Because the program can download from multiple devices simultaneously, think of more than one firehose operating at the same time. To link this back to the high-level overview, a copy files process is like a firehose.
Maybe you have seen the film Ghostbusters? If so, you will know that the streams should never cross. It's the same here. The firehose streams should never cross. They should never cross because each copy files process generates random destination filenames that are guaranteed to never clash with other destination filenames under any circumstances.
Once the firehoses streams have reached their mark, that is, the files have been downloaded into temporary folders owned by each stream, the file renaming process (of which there is only one) renames the files. It is assumed that the file is already located on the destination partition/volume/file system. This assumption is why you are seeing the bug.
This design is robust.

The problem with shutil.move is that it is doing a file copy from the same file system that the firehose is writing files to, as the files are being written. On hard drives, this can kill performance. Even on an SSD, it will prove notably detrimental. It may not make a lot of difference if you're only downloading 20 files or so. But photo lapse users can download tens of thousands or even hundreds of thousands of files at a time.

The solution is to identify which partition/volume/file system each file will end up in before the files are copied.

However, if we think about it for a minute, the design of the program dictates that the copy files process needs to know where it will copy the files to without knowing anything about file renaming and subfolder generation, because file renaming and subfolder generation is the step that occurs after the files are copied. Initially, this might seem like a daunting problem.

The good news is that Rapid Photo Downloader already has logic to generate subfolders before the files are downloaded. It does this to preview the download folders in the home/main tab of the program window. I don't know if you have noticed those black folders in the file system view. Those are the preview folders. Importantly, they are actually created on the file system, as actual subfolders. (This is due to a limitation in the Qt toolkit's file system browser that does not allow for virtual folders — it only displays real subfolders.)

The code that generates the preview folders is found here, here, and here. To make sense of this code, you need to refer back to the high-level overview.

The basic idea here is to use the folder generation preview code to identify either the st_dev, or perhaps a more abstract approach, and then use that to determine on which partition/volume/filesystem the temporary folder that the copy files process should use for that file (i.e., where the firehose should be directed).

This code should not actually be that complicated. The main tasks are to identify the location of the temporary folders for each download file (which is easy), and then track their use and eventual removal once the download is finished (which the program already does, of course, the code just needs to be added to).

damonlynch commented 9 months ago

Hello @jcreixell what do you think of the changes I have proposed? Are you interested in working on them?

jcreixell commented 9 months ago

Hi @damonlynch, I understand your proposal but unfortunately I don't have the bandwidth to implement it right now.

I also worry about increasing the overall complexity of the software to accommodate a niche use case. I am personally content with having cross-device symlinks work even if it involves a double copy. However I do not handle thousands of files per import and understand it could be a deal breaker for some users.

If you think a suboptimal solution is better than no solution for the time being (maybe with some warnings) feel free to merge it, otherwise feel free to close this PR.

And again thank you for your clarifications and context, hopefully this inspires somebody else to give it a shot.

damonlynch / rapid-photo-downloader

Mitigate invalid cross-device link errors #116