cryptomator / cryptofs

Java Filesystem Provider with integrated encryption
GNU Affero General Public License v3.0
94 stars 35 forks source link

Increase threshold before long filenames get inflated #60

Closed infeo closed 5 years ago

infeo commented 5 years ago

Version: 1.8.6

Short Description

Increase performance and compability of cryptofs by increasing the treshold before ciphertext filenames are shortend.

Description

To keep compability with certain OS (e.g. Windows), cryptofs shortens the name of ciphertext files if their base64 encoding exceeds a certain treshold. Currently the treshold is set to https://github.com/cryptomator/cryptofs/blob/851f44090db4b068c7ac7fe27adcecd4c32767e5/src/main/java/org/cryptomator/cryptofs/Constants.java#L18 .

If the treshold is increased, file name shortening appears less which makes cryptofs more robust to race conditions, increases the performance due to direct file access and makes it more compatible with certain cloud syncing software (Google BackUp & Sync).

The suggestion is to set it to 254 as the new value. It is computed the following way:

  1. 248 (Multiple of 8 which is less than the Windows MAX_PATH limit (260) due to base64 encoding)
  2. +2 (addtional prefix indicating symbolic links or directories)
  3. +4 (file extension)
overheadhunter commented 5 years ago

For the record: 4 chars file extension is just reserved for future use, so we don't need to migrate filenames if we decide to add an extension (see #54).

overheadhunter commented 5 years ago

For the record: 248 BASE32 chars giving us a maximum ciphertext length of 155 bytes, which consists of 16 bytes IV and 139 bytes payload.

The maximum cleartext filename before shortening happens (⎣ 248 * ⅝ ⎦ - 16 = 139 bytes) is this long:

e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855e3b0c44.txt

The previous maximum cleartext filename length was ⎣ 129 * ⅝ ⎦ - 16 = 64, i.e. not even half as long:

e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852.txt


When looking at filename length distribution functions (such as on page 26 in this paper), we can clearly see that short filename are far more common than long ones. While e.g. the examined "PDL-Home" only a 0.99 percentile was shorter than 63 chars, 0.9999 of all filenames were shorter than 135 chars.

While I have no clue what kind of files are stored on PDL-Home, this was the "worst case scenario" in the study and the general idea seems reasonable, that the number of files declines drastically with the length of their names.

What does this mean for us? By doubling the threshold we have reduced the likelihood of name shortening happening by 2-3 orders of magnitude.

overheadhunter commented 5 years ago

Filenames must be encrypted deterministically, otherwise we break directory listings. By changing the shortening threshold we break this rule.

Therefore this change requires a new vault format version, otherwise we'd break compatibility with other applications (such as our mobile apps or @iterate-ch's Cyberduck and Mountain Duck).

Therefore we defer this issue to a different minor version.

overheadhunter commented 5 years ago

The maximum filename length on many file systems and/or clouds is 255. But: We need to leave some space for a cloud service to add some (conflict 2018-08-21 00-22-09) suffix. Let's say we want to reserve 35 chars.

This leaves us with 220 usable chars for prefix (2 chars), encoded ciphertext (216 chars) and extension (4 chars). 216 chars in base32 encoding is equivalent to 135 bytes of ciphertext. Subtracting the IV this gives us 119 cleartext chars.


If we have to migrate every filename anyway (see #64), we might want to switch to base64. With the same 220 usable chars our 216 encoded ciphertext chars are now equivalent to 162 ciphertext bytes or 146 cleartext bytes.

On aforementioned PDL-Home this reduces the likelihood of name shortening even further, only 0.001% of all files would have been affected from shortening.

overheadhunter commented 5 years ago

We asked our users about their average name length here. The results are published here.

overheadhunter commented 5 years ago

Syncing with OneDrive :white_check_mark:

  1. Created new vault "format7" inside OneDrive on Windows 10 1803
  2. Created new file with 146 chars filename, resulting in a 220 char ciphertext name. Full path is C:\Users\Sebastian\OneDrive\format7\d\IX\LPPQAPWUNCQTWROJY734VIT7YTQEGM\xr3BnomfWSZLqb13EBi1zwXu34xDwQQtPTkqmSYvtBW6Qg9ae4FIDHP7ByJFdKSJfqwFfWiUojKIlHxCwD8a5U6yojKfAPftXWiAYIo9dQthCC16M3uxkIzaPrDET6-2yuHCX8gECd0LdbMC-qDi1LOxj4koqQbfsAGnhoQ6_SOgKdn3dQWaInAo1AUx7aMP5soJ0Xai1OKpykak3vgB4QK7.c9r.
  3. Synced from Windows using OneDrive 19.152.0801.0008 to macOS using OneDrive 19.152.0801.0007