macOS: borg mount problem with unicode normalisation

aljungberg commented 5 years ago

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.10

Operating system (distribution) and version.

macos 10.14.6

Hardware / network configuration, and filesystems used.

MBP, Mac OS Extended

How much data is handled by borg?

1.5 TB

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg init -e none test.borg
mkdir test
filename=`printf 'la\xcc\x88.txt'`
echo "hello" >test/$filename
borg create test.borg::a test
mkdir test2
borg mount test.borg::a test2
ls test
# lä.txt
ls test2/test
# lä.txt
cat test/lä.txt
# hello
cat test2/test/lä.txt
# cat: test2/test/lä.txt: No such file or directory

So there is a file we can list but not open. I have the same problem with a real archive.

Thoughts

So the problem here is likely related to https://code.google.com/archive/p/macfuse/issues/139#c2: a mixup between precomposed and decomposed UTF encodings.

In fuse.py, in lookup, we have this:

inode = self.contents[parent_inode].get(name)

Here we get name encoded as b'l\xc3\xa4.txt' (precomposed). But in the archive, in self.contents[parent_inode] we have {b'la\xcc\x88.txt': 1000041} (decomposed). Both forms are technically equivalent:

>>> os.fsdecode(b'l\xc3\xa4.txt')
'lä.txt'
>>> os.fsdecode(b'la\xcc\x88.txt')
'lä.txt'
>>> unicodedata.normalize("NFD", os.fsdecode(b'l\xc3\xa4.txt')) == unicodedata.normalize("NFD", os.fsdecode(b'la\xcc\x88.txt'))
True

I would have liked to submit a patch but I honestly don't know the correct way to deal with this. If I add name = os.fsencode(unicodedata.normalize("NFD", os.fsdecode(name))) in lookup, it works for this particular error. But it would break if the archive used NFC instead.

We could also normalize the encodings before writing them to self.contents. This would work in both cases. But what encoding is supposed to be used to begin with?

ThomasWaldmann commented 4 years ago

Collecting some facts / questions (correct me in case I am wrong):

on macOS HFS+, file names are NFD-normalized
on Linux filesystems, it could be anything, but NFC is the usual thing
borg currently just archives whatever byte representation it gets from the fs and does not do any normalization
when typing some special char on macOS terminal it is NFC or NFC normalized or can it be both?
when typing some special char on Linux terminal it is NFC or NFC normalized or can it be both?

ThomasWaldmann commented 4 years ago

Hmm, guess on macOS, typing stuff into terminal creates NFC and in the archive we have NFD, so there is no match.

But if one would copy and paste the filename from ls output to the cat command, it should work, right? @aljungberg

Also, using some gui filemanager which just uses the same filename for opening as it shows in a directory listing should also work, I guess.

ThomasWaldmann commented 4 years ago

Considering unicode not normalized vs. 2 different forms NFC / NFD of normalization:

If we:

don't know what we have in the archive (could have been made on macOS, on Linux, on ...)
don't know what we get (e.g. due to being typed in, on macOS, on Linux, on ...)

Then I only see 2 ways:

just expect / demand same byte representation (== not borg's fault if you do it wrong)
do a triple lookup:
- first try: use whatever byte representation we get
- miss: normalize to NFC, lookup again,
- miss: normalize to NFD, lookup again.

The triple lookup would only work easily for FUSE mount. For borg extract and pattern matching, I guess it would be quite a pain.

aljungberg commented 4 years ago

But if one would copy and paste the filename from ls output to the cat command, it should work, right?

Correct. And like you say you can view the file in Finder too. But it doesn't work with autocomplete in the shell funnily enough, so it's by no means universal. Some tools are broken by the difference.

just expect / demand same byte representation (== not borg's fault if you do it wrong)

As I understand it on the Mac, HFS+ requires NFD normalisation. So if you extract a borg archive on a Mac and then archive it again, it'll all turn into NFD even if you wanted to standardise on NFC. So it's not within the user's control exactly.

I created the original archive upon which I discovered this error on a Mac as well, but it was with attic, I think. No idea how it ended up with some NFC files. But it does seem possible to get this on a single platform if you're a little unlucky.

I think the easiest way to resolve this is to pick one normalisation scheme and then use it everywhere. Ideally even when creating the archive, but since that train has left the station now I guess, we could just do it in the FUSE layer.

I don't see any immediate problem with your solution. Looks reasonable.

But you could also just normalise ahead of time.

So let's say we pick NFC. Then we can just do the equivalent of self.contents[parent_inode] = {unicodedata.normalize("NFC", k): v for k, v in self.contents[parent_inode].items() (except we'd never do that, just do it right away when building the dict). Now when we want to look up an inode we call self.contents[parent_inode].get(unicodedata.normalize("NFC", name)).

This will work except if you have two files with the same name in the same folder, differing only in their encoding. But that situation is caused by not normalising the names to begin with when creating the archive and it's too late to try to fix it when accessing the files. It's a bit of a pathological situation, not sure what would even happen if you tried to extract such an archive on a Mac.

pgerber commented 4 years ago

I'd be very careful changing the encoding in any way. I've seen mixed encoding (some using NFC and some NFD) when the files where created on Windows in particular. I'm not sure why exactly, could be that some tools just encoding one way or the other. I'm sure changing the encoding while storing or restoring the backups would break some of our systems. In particular, we have some customer-supplied HTML files containing things like \ encoded either as NFC or NFD. Changing the file name will break this reference. Also, I'd not be surprised if some other applications could no longer find files after the encoding changed.

pgerber commented 4 years ago

I just remembered, I've also seen NFD encoded file names on Linux when the file wasn't copied from another platform. This can happen when you copy and paste the parts of the file name. For instance, when you copy it from a web page encoded in NFD.

pgerber commented 4 years ago

I tried to figure out how others dealt with the situation and the more I research the more confusing the station gets. If I understand this blog post correctly the NFD normalisation only happens with the HFS+ filesystem but not with the apparently newer APFS. Also, Python appears to do some weird transcoding of filenames, at least on Macs.

ThomasWaldmann commented 3 years ago

I had (again) a look into this (I have a macbook air M1 now, so that motivated me). My current understanding is:

it's the HFS+ NFD normalisation that causes the trouble, the more modern and nowadays used APFS does not do that.
cli argument file names (when command is typed in), python console input file names (when typed in), file name seen on APFS via os.listdir are all in sync, no problem there.
borg, as a backup tool, is expected to exactly reproduce any names given to it (references to files via their names could break otherwise [and borg does not control all references, it's not just about symlinks, but could be also a file name inside a database, for example])

History repeats somehow every time some system automagically tries to "fix" or "deal with" some character representation.

For example, on windows filesystems like FAT or NTFS, ages ago somebody thought case-insensitive matching was a good idea. In the end, this only caused lots of trouble and just not doing that would have been way easier and less error prone. "A" is just something different than "a".

I see the NFC vs NFD as a similar, just more evil case: somebody thought HFS+ auto-normalising to NFD was a good idea, but in the end, it's just causing lots of troubles (esp. since everybody else uses NFC). Like "A" <> "a" (see above), here we have NFD(ä) <> NFC(ä) and, because A<>a was way too easy and visibly different, we also have NFD(ä) looking like NFC(ä) [both looks like "ä"] to make things more interesting. Haha, gotcha!

So I tend to close this with "wontfix" because there is nothing reasonable to do here. borg needs to preserve things precisely and also expects the user to be precise (when looking up things).

ThomasWaldmann commented 3 years ago

To show the issue with the original example from top post:

>>> import os
>>> s = os.fsdecode(b'l\xc3\xa4.txt')
>>> S = os.fsdecode(b'la\xcc\x88.txt')

>>> s
'lä.txt'
>>> S
'lä.txt'

# ^^^ so both s and S **look like** they are the same!

>>> s == S
False

# ^^^ but they are **NOT THE SAME**

>>> s.encode().hex()
'6cc3a42e747874'
>>> S.encode().hex()
'6c61cc882e747874'

# ^^^ the difference gets visible in the hex representation of s and S. s is NFC and S is NFD.

ThomasWaldmann commented 3 years ago

BTW, this is not saying there is no practical problem, it's just saying we should not try to "solve" it in borg.

Workarounds:

borg mount: use a file manager (like midnight commander mc [console] or macOS finder [GUI]): it will use the actual file name representation (NFC or NFD) to access a file.
borg extract: use borg list REPO::ARCHIVE to get the actual representation of a file/folder name and give that to borg extract.

In general: just typing in a name that looks like the same might not be good enough

textshell commented 3 years ago

According to unicode borg behaves wrong (as unicode expects applications to respect canonical equivalance). But on the other hand borg never claims unicode conformance. But maybe it could be useful to have explicit options for renormalize when doing matching.

For the extraction case there are external tools for reencoding filenames. So borg can reasonable suggest users to use such tools to postprocess extracted directories, but for matching that would be to late, so matching is where there is a gap in capability so matching would be where borg could offer value.

Worst case could be 2 filenames looking the same, but not being the same. so a match would match both and extract them using the unprocessed filename (as stored in the archive).

For fuse things are a bit more complicated and likely a byte exact match should be done before a renormalited match. I think the usual "extract via fuse" methods would not ask for file names that are not in the archives. So only manually typed file names would generate different file names than what is in the archive (which should be ok when the user explicitly opted in).

ThomasWaldmann commented 3 years ago

Reopening so the suggestion of @textshell can be worked on, thanks for the feedback!

borgbackup / borg