duplicati / duplicati

Store securely encrypted backups in the cloud!
Other
11.14k stars 905 forks source link

Make the OSX file listing more intuitive and similar to how OSX works #2267

Open fjnorb opened 7 years ago

fjnorb commented 7 years ago

(Apologies if this is not the right place to post this.)

I am desperately trying to liberate myself from CrashPlan but all other services are missing features I need. I was suggested to try Duplicati, which I am doing today, to Amazon Drive. I am running the version offered at your website, 2.0.1.35_experimental_2016-12-13.

Installed on a Mac running MacOS X El Capitan 10.11.6, as a first test I tried to back up the entire boot drive ("Macintosh HD"), which is 250GB. On this same machine I have a 4TB secondary drive, a 5TB Time Machine drive, and was connected to a drive on my NAS.

After trying to understand how my file selection in the backup config would affect what was backed up, I selected only "/Volumes/Macintosh HD," because it seemed to catch the whole drive without including the actual "/Volumes" folder where anything else on the system would be mounted. I saved it and let it run.

By the time I had stopped it from "counting" files, it had found approximately 24TB of them, which seems impossible given my file selection. I checked what it was doing by running fs_usage in a terminal, and it appeared to be deep into my Time Machine drive, going through all the backups (which of course are mostly hard links, not actual files taking up actual space).

Duplicati seems like a good solution for me otherwise - cross platform, encryption support, local and WAN backups, leveraging cheap cloud storage - but if I can't understand what I am selecting to back up, then I won't be able to trust it. I hope this is just a mistake on my part and Duplicati could be smart about not going down rabbit holes or getting caught in symlink/hard link loops.

I would greatly appreciate any assistance anyone could provide for this issue. If this is indeed something that needs to be changed or fixed I can provide logs or any other information you'd like.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41126114-make-the-osx-file-listing-more-intuitive-and-similar-to-how-osx-works?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F4870652&utm_medium=issues&utm_source=github).
kenkendk commented 7 years ago

Symlinks should be dealt with by just storing the link target, but you can choose how to handle them with --symlink-policy.

Hardlinks are harder to deal with as there is no real indication of the fact that they are hardlinks. There is a check for keeping track of multiple hardlinks pointing to the same place, but it sounds like this was fooled by your setup.

I am not sure where in /Volumes/Machintosh HD it finds TimeMachine.

Generally I recommend that you do not attempt to make a backup of the whole disk, for the simple reason that you cannot restore the whole disk to a meaningful state. Backing up an operating system requires more low-level disk access than what Duplicati supplies. Backing up the whole disk gives you a lot of data that is not useful and better restored by reinstalling the operating system.

Instead, I recommend only backing up /Users/username, as that is where files created by the user should reside. You may also want to exclude ~/Library as that contains many unwanted files.

fjnorb commented 7 years ago

Thanks for the reply.

Unfortunately, on your average power user's Mac, both /Library and ~/Library often contain occasional, hard-to-pinpoint files that can be unique to that computer and that user, and that contain unique, critical data, that can save a lot of work when rebuilding a system after a failure. It's largely inconsequential whether or not the Mac should save it there, in some circumstances it does. Similarly, if you use a CLI shell with any regularity, sometimes there are programs that can be restored simply by copying one binary back into place. But between MacPorts, Brew, Perl, Python, they could be in /bin, /usr/bin, /opt, or 40 other places.

Additionally, especially on Macs, when there's an update to an app (in /Applications; /User/username/Applications is used so rarely it doesn't warrant discussion), and you update it, your current version of the app goes away, completely; and in some cases so does your ability to re-download that version of the app from the App Store. If the new version doesn't work correctly for whatever reason, "you should only back up /Users/username" is not a great answer.

These are examples of files you only need after they're gone, and you aren't necessarily notified of the time and location of their creation, deletion, etc. Pro audio and video software will store effect presets in ~/Library sometimes, and plugins tend to live there too. Unless I'm missing something super-obvious, I can't think of a safer option than to just back all of that stuff up and not be surprised about what I need later.

Please understand: I mean absolutely no disrespect. I do not interrupt comprehensive FOSS software development and demand it cater to my needs for fun. I am simply trying to evaluate Duplicati in an area of software that seems to be cost-heavy and feature-poor. Some FOSS software opens suggestions with open arms, some don't, and I don't know anything about this project. That being said, it surprises me that the first thing any user concerned about data security would do - just back up everything - would cause such unhandled problems, and that the solution is so manual.

It might be logical and rational in some people's minds to start with one folder, or trust that your computers are only putting files in obvious locations - "My Documents" for instance - but as I said above, oftentimes what affects you is the data you don't think to back up.

MacOS tends to use a Unix-style "one filesystem" paradigm at a low level, and then treat each disk/partition separately in the user interface. If you don't have many Mac users I can see how that might be overlooked. At the interface level, each Volume is treated separately - file operations on MacOS don't recurse into other filesystems. /Volumes is hidden from Finder.

That being said, I thought perhaps there was an issue in how Duplicati was operating on the filesystem because I specifically selected an item in the list which, when expanded, didn't include /Volumes. Whether or not it's good practice to do so, Duplicati appears to be descending into a directory its user interface insists is not there. If a Time Machine volume is symlinked somewhere else in the filesystem, I would expect software that operates on the filesystem to be aware of that. Even CrashPlan, which I'm desperately trying to escape from, doesn't expose /Volumes, even if "show hidden files" is checked; though I'm sure that's because part of their business model depends on selling a license for each PC and exposing /Volumes would allow a user to back up network shares.

I sincerely appreciate any further discussion on this issue. Data backup is one of those things where I don't even want to be given the chance to be wrong about what I'm doing.

kenkendk commented 7 years ago

Thanks for sharing your thoughts. I am always interested in making Duplicati a better out-of-the-box experience, so user feedback like yours, where you see it "for the first time" is much appreciated.

I am a macOS user myself, so I am aware of many of the pitfalls you mention. You can actually get /Volumes in Finder, just press CMD+Up until you arrive at the root. It does hide TimeMachine, but everything else appears to be there.

I was not implying that it should be impossible to back up /Volumes/Macintosh HD, just that it would be a large overhead with many files that would have zero value (i.e. all the macOS system stuff). I tend to use only FOSS application and stay clear of the AppStore on macOS, so I would not personally back up /Applications, but you can add both /Users and /Applications if you like.

All the stuff from the package managers should be easy to re-install, so I don't see why you would back that up. They also have versions, so you can easily pick an old version if you like.

As for ~/Library ... there is a lot of junk in there, caches, autogen files, etc. I checked my setup and found zero files that I would want in there, which is why I suggested to exclude it. If you use an application that stores important user content in there, do include the folder, but also: complain to the developer :).

For the original problem; I do not know what would cause a file traversal to end up in TimeMachine, but I would like to fix it; it could be something stupid in Duplicati.

If you run the following command on your system:

/Applications/Duplicati/Contents/MacOS/duplicati-cli test-filters "/Volumes/Macintosh HD"

You can see all the paths Duplicati encounters and it explains what it does when it encounters each path.

You can add filters to the command to see what you can exclude to make it work like you want: --exclude=/Volumes/Macintosh HD/junk/

If you can figure out what makes it jump into TimeMachine, maybe I can fix it.

If you want to know more about filters: http://www.duplicati.com/articles/Filters/

fjnorb commented 7 years ago

Thanks for your good reply. I never know, commenting for the first time on a FOSS project, whether my input is desired. I'm grateful that it is.

You're definitely not the average test case on MacOS if you only use FOSS software. For Logic Pro X, for instance (I'm a musician), /Library/Application Support/Logic/ is over 8GB, and I don't have all the additional content installed. In circumstances other than total disk failure, if something there goes weird, it's a lot easier to restore from a local backup than to download it all from Apple again. For audio and video software, a lot of times temporary files you're working on get stored in /Library. A screen recorder I use, for instance, has a habit of crashing and deleting a 15GB capture file on the way out. Sometimes I can go back and get 90% of it back from the last backup.

Here's another reason I'm a big fan of the "back up everything" paradigm: The higher up I can specify what to back up, the less I have to remember to change my filters when a file or folder gets created in an unusual location. Not that I do this as a rule, but for example, if I select all the folders at the top level of /Volumes/Macintosh HD today, and then in a few weeks some misbehaving program puts a file or folder in the root of /Volumes/Macintosh HD (again, not saying that's normal practice for me, but a convenient illustration), Duplicati won't back that file or folder up, because it's not recursing into /Volumes/Macintosh HD, it's recursing into specified directories in /Volumes/Macintosh HD.

Anyway.

I think the way any software that uses disks on MacOS needs to work like all the other software on MacOS. Meaning, list out the volumes in /Volumes, and make them selectable, with checkboxes similar to the files and folders underneath, and then the filesystem has to be traversed in a way that mirrors how MacOS and most apps do. This doesn't always align with the underlying filesystem. The assumption in the UI when selecting Macintosh HD is that /Volumes is never recursed if whole-disk operations are desired.

I'm not comfortable with doing a blanket exclude of symlinks either, because I don't know where every symlink exists in my system. If I lose a symlink that I discover I need, I probably don't have the information to know that it's a symlink, and probably think it's a file. Whether Duplicati stores a symlink there, or a copy of the file, I will be looking there for the file. I don't know anything about how to check what is or isn't a hard link (I understand there's a command you can run that shows how many files link to that filename; but it's probably not enough just to check if that value is < 2), but I have a feeling hard links are reserved in MacOS mostly for Time Machine.

Also, I tried running /Applications/Duplicati.app/Contents/MacOS/duplicati-cli test-filters "/Volumes/Macintosh HD" as you suggested. Unfortunately, this is what it did:

No certificates found, you can install some with one of these commands: cert-sync /etc/ssl/certs/ca-certificates.crt #for Debian based systems cert-sync /etc/pki/tls/certs/ca-bundle.crt #for RedHat derivatives Read more: http://www.mono-project.com/docs/about-mono/releases/3.12.0/#cert-sync Error reported while accessing file /Volumes/Macintosh Failed to process path: /Volumes/Macintosh => Path doesn't exist! Error reported while accessing file HD Failed to process path: HD => Path doesn't exist! Matched 0 files (0 bytes)

(Sorry, I don't know how to do a code block in Markdown.)

So there's something going on with how duplicati-cli is parsing paths with spaces. I tried the same command but with the path escaped (/Volumes/Macintosh\ HD and no quotes) and the output is the same.

As for the filters, I am using the WebUI, so I will figure out where to add that filter. At this point I'm sure I can exclude /Volumes and then selectively re-add what I need (my 4TB data drive, for instance). But yeah, for the first time user, I think we need to look at:

That seems to really be the only difference in setting up Duplicati and setting up one of the lackluster commercial solutions like CrashPlan.

I'll be away from home this weekend but I'll check in here for further suggestions. Thanks!

fjnorb commented 7 years ago

I have a little more information on this.

In the file select screen for MacOS X, the path /Volumes/Macintosh HD/Volumes does exist. I believe this was causing it to count Time Machine backups even when I didn't select the Time Machine volume in /Volumes. There's the recursive loop.

Now, this is my fault for missing it, but I'm still a big believer in making backup interfaces idiot-proof because an idiot like me is more likely to lose his data in the first place. A Volume selection interface like on Windows, where I can put a check next to C: but not D:, and an understanding that /Volumes is then ignored in favor of this interface paradigm, would be awesome.

I missed it, it turns out, because MacOS sorts alphabetically but does not take into account uppercase/lowercase. So in the list I was looking at, at /Volumes/Macintosh HD, /Volumes was the sixth or seventh directory listed out of about 20, with things like /bin, /dev, /etc, /net, and /opt below it. I was expecting it to list the folders truly alphabetically, regardless of case.

This is different from how Linux lists files, for instance. Linux takes case into account when sorting; MacOS doesn't.

Since Macintosh HD will always be mounted in /Volumes, the risk of not excluding it from the interface means that even average Mac users who only have one drive will get into a situation where Duplicati goes to /Volumes, finds the Macintosh HD again, and while parsing it, finds Volumes again, and Macintosh HD again, etc and never ends.

In the case of the Mac I originally posted about, selecting /Volumes/Macintosh HD entirely for backup and then unchecking /Volumes/Macintosh HD/Volumes seems to behave as expected. But I still argue that I wouldn't be in there if Duplicati had a provision to treat volumes like the rest of the GUI does. I'm okay now that I know this, but perhaps it trips up all new users on MacOS, so I'll advocate for it to be added to Duplicati.

Thanks for your patience and interest in my observations.

kheischer commented 7 years ago

As a suggestion, you should start with "/" because "/Volumes/Macintosh HD" is a symbolic link to "/".

For the difference between Linux and MacOS with the sorting, in MacOS the root filesystem in a standard installaion is ignoring the case. You can create a HFS+ filesystem that is not ignoring the case. With such a filesystem, MacOS an Linux should behave similar. But many programms in the AppStore are not working/testet with a HFS+ filesystem which is not ignoring the case.

@kenkendk: for ignoring/dedecting TimeMachine backups by default: maybe include a filter --exclude "*Backups.backupdb/" in default-setting, or produce a warning if detected.

fjnorb commented 7 years ago

Not what I'm talking about. We're discussing whether it would be better for new Duplicati users, to clarify what is being backed up, if Duplicati behaved according to the UI language for MacOS, which removes the concept of a /Volumes mountpoint from the GUI user and treats all mounts in /Volumes like Windows treats drives and partitions.

In the Windows version of Duplicati, for instance, there is a checkbox for C:. But in MacOS there is no checkbox for "Macintosh HD" or other drives. (I understand that Windows does not use a single-tree filesystem, but day-to-day users of MacOS don't use it as a single-tree filesystem.)

I am starting at /Volumes/Macintosh HD, not because I don't know it's a symlink to /, but because I want to back up the entire drive, and only that drive - whether or not that's your preferred backup strategy - for reasons I've outlined above. Since there is no checkbox for "/" like there is a checkbox on Windows for C:, I select /Volumes/Macintosh HD in order to have a similar UI experience.

Whether I check /Volumes/Macintosh HD or check all pre-existing files and folders in "/", I still have to exclude /Volumes/Time Machine or /Volumes/Macintosh HD/Time Machine.

Again, as an issue for new Duplicati users or non power-users, there is never a use case where it is sane to do anything contrary to the OS interface language, certainly not as extreme as expecting MacOS Duplicati users to reformat to a differently-configured HFS+ filesystem. That's analogous to working at public works and getting a call stating there is a tree branch in the road in front of someone, and you tell that person to simply sell their car and buy a helicopter instead of moving the tree branch out of the road. What we were talking about was only that I initially missed the existence of /Volumes in the file/folder dialog because it was not at the bottom of the list where I expected something starting with a "V" to be. On the GUI side, Finder does not sort like this. It sorts case-insensitively. Your "standard installation" of MacOS uses the Finder for the majority of things, and certainly anytime you see a dialog in MacOS, files are sorted case-insensitively. Unless you go to a Terminal and type "ls" at a bash prompt. I don't know why bash is different here; it does not do this on Linux so that makes me think it's using an OS call that is just weird.

Incidentally, MacOS's "standard installation" and default HFS+ is a case-INSENSITIVE file system. It just happens that some system call for listing files and folders returns them sorted in a case-sensitive manner. If you erase a disk in Disk Utility, the default is "OS X Extended (Journaled)," and the next option in the list is "OS X Extended (Case-sensitive, Journaled)." And from what I can tell, less than nobody uses it.

Now, I've obviously figured all of this out, so this isn't a case of the popular stereotype about Mac users being stupid. But I'm also a big advocate of making backup software idiot-proof. But our human proclivity to make mistakes is why we keep backups; so I feel strongly about trying to catch as many preventable mistakes as possible within backup software. That's what the discussion is about.

People start FOSS projects like Duplicati for a lot of reasons, and they're typically not capitalistic ones. I'm coming from a place where I need all the features that CrashPlan provides, but their product is such a piece of garbage I want to leave. So I have been looking for a solution that provides all the things I use CrashPlan for, but is not a piece of garbage like CrashPlan is. And I haven't found one besides Duplicati. Backblaze, for instance, does not have Linux support, nor local backup support. (Since my local backups are to a Linux NAS, Duplicati has enough "local backup" support because it will send backups to an SCP/FTP target.) That's where I'm coming from. But I know of many, many others who desperately want to get away from CrashPlan, and I want to tell them about Duplicati. But many of them are not as tech-savvy as me, and don't respond well to complicated and tech-heavy instructions. When you install CrashPlan, for instance, it automatically selects your user directory for backing up and starts backing up! (Granted, to their cloud, so there's no exact analog in Duplicati.) But there is a bigger market than you'd think for a CrashPlan replacement. If FOSS software is to compete with it, it's my feeling that a good, easy to understand user interface should be a priority.

What I haven't had the opportunity to say yet is that Duplicati's interface is good and easy to understand. It's just this one thing related to the differences in the backup selection dialog that I would love to see fixed. That's all.

kenkendk commented 7 years ago

I have changed the title based on how the discussion evolved.

EricTheRed1 commented 7 years ago

I too have encountered this exact same issue. The "loop" in backing up a volume on the Mac caused me to not use Duplicati years ago. Now that I have seen this happen in other backup programs, I have learned how to "fix" this issue and not have it endlessly scan.

More novice users will be confused by this; especially in selecting a non-system volume to back up.

fjnorb commented 7 years ago

I feel like I have to say this to everyone who replies, which is sad. Language barrier perhaps? I always knew HOW to fix it; I shouldn't have HAD to.

After clearing this up, Duplicati ran for a few days and then started throwing generic errors with no further explanation. Fearing I'd be made to feel stupid about that too, I discontinued using it and moved on to something else.

On May 6, 2017, at 1:48 AM, EricTheRed1 notifications@github.com wrote:

I too have encountered this exact same issue. The "loop" in backing up a volume on the Mac caused me to not use Duplicati years ago. Now that I have seen this happen in other backup programs, I have learned how to "fix" this issue and not have it endlessly scan.

More novice users will be confused by this; especially in selecting a non-system volume to back up.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.