chrissimpkins / crypto

Simple symmetric GPG file encryption and decryption
http://chrissimpkins.github.io/crypto
MIT License
48 stars 27 forks source link

File name encryption #2

Open chrisidefix opened 9 years ago

chrisidefix commented 9 years ago

An important aspect of encryption before syncing files in a cloud-storage-space is file-name-encryption. It would be great to see (an option) to encrypt/decrypt file names.

This might not be terribly important for generic image file names (IMG_00123.jpg doesn't really reveal a lot), but it becomes a lot more interesting when encrypting personal documents. File names can already reveal a lot of information and often already contain personal information (e.g. BankXYZStatement.pdf tells me that you are probably using Bank XYZ).

Of course it is also making it difficult to find files again - if I have 30 phone bills in a folder, I won't know which one was for March - that's why it would be great to choose if file-name encryption should be ON or OFF by providing another command-line option (e.g. --filename).

chrissimpkins commented 9 years ago

interesting idea, though this would result in the loss of valuable information for the user and decryption of all files in the directory to find the one that you need can take a significant amount of time depending upon the number and size of files that you are dealing with. not to mention that you would then need to open and review each file. could map the filename to a hash digest of the file contents and the user keeps that locally. if the contents don't change it would be possible to get the original filename back when it is decrypted. let me give this some thought.

chrisidefix commented 9 years ago

yes, if you don't know which file is which it is of course not very user-friendly. Here are some thoughts I had regarding this:

1 - Ideally, the file name could be encrypted with the same passphrase as the file and thus be decrypted just as easy. This, unfortunately, doesn't work as the encryption creates lots of characters you usually don't want in file names (e.g. '/') and also would make the file name length unusable (many cloud storage providers limit the file name length). So, unless there is a robust encryption algorithm that guarantees these requirements, it's probably not going to happen.

2 - Use some random UUID (e.g. uuid.uuid4() in python) to generate new filenames. You need to find a way to map these to the original file name. Right now I just store the mapping of each uuid to the original file name in a separate file. Of course this means you will need the second file to figure out which file is which. You could then just encrypt this (very small) file with the same passcode for each encrypted file or create a general "database" file containing all mappings for a directory.

To be sure you won't lose all your file names, even if you misplace this mapping file, I wrap all files inside a tar ball (uncompressed) before encryption. This way I just have to unpack the archive and don't have to rename any files.

For encryption it basically looks like this (tarfile then crypto): SecretFileName.pdf -> 1234-5678-abcd.tar -> 1234-5678-abcd.tar.crypt This way the file name (as well as creation time, etc.) of the original has never been changed and will be restored during decryption and unpacking, which is great if you want to keep your image creation dates.

One last note - Boxcryptor supports file name encryption, but I have no idea how they implemented it.

chrissimpkins commented 9 years ago

thanks for these suggestions. I will toss this around over the course of the week and see what I come up with. really appreciate it.

-C

chrisidefix commented 9 years ago

I just realized that there already are command-line options for gpg to encrypt / restore file names: https://www.gnupg.org/gph/en/manual/r1172.html

--set-filename STRING to store a chosen filename encrypted inside a message

--use-embedded-filename to extract the original filename for decryption (likely should be used with --overwrite enabled)

The remaining question is what to name the encrypted output file. Should this be chosen by the user or (as I suggested earlier) simply pick a random UUID.

chrissimpkins commented 9 years ago

If I understand this correctly it embeds the real filename in the encrypted message. If so, this is perfect and eliminates the need for users to maintain a key to the encrypted files that have filename anonymity.

I like the UUID naming approach. Could use uuid.uuid4() with the uuid.hex property (removes the dash chars).

import uuid

filename = uuid.uuid4().hex

Next question is whether we eliminate the original file extensions as well. I would favor eliminating them if they embed with the base filename in the encrypted file and leaving the file with the .crypt extension. This would generate anonymous filenames like this:

f25af43f1df04ed293e022f2b44ac942.crypt
chrissimpkins commented 9 years ago

The --tar switch code is now merged into the development branch. I am going to add unpacking of tar files on the decryption side so that this is automated for the tar files that are created with this new approach.

If you decide to work on the filename code, please rebase on the updated development branch. I made some very minor modifications. This looks great. Thanks again for this contribution.

Leaving this issue open as we work on the filename anonymizer feature.

Task list for this release is in Issue #5 .

chrisidefix commented 9 years ago

Yes, I started some implementation on the filename encryption - using the uuid.hex option is a good idea and I would remove the file name endings as well since it is stored in this embedded message package.

chrissimpkins commented 9 years ago

Bumped this and the passphrase issue to v1.5.0. We'll push the new tar/untar features and some other minor code refactoring in v1.4.0. Task list for v1.5.0 in Issue #12