elasticdog / transcrypt

transparently encrypt files within a git repository
MIT License
1.43k stars 102 forks source link

Handle quotes and colon in filenames #173

Open andreineculau opened 9 months ago

andreineculau commented 9 months ago

Somewhere we don't escape filenames properly, ending up with errors like

fatal: ambiguous argument ':"Workspace/ETL/Jobb/Generalist Marketplace/ismdou - Know the identifiers of the table \"FOO\".py"': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
fatal: path 'Workspace/ETL/Jobb/Generalist Marketplace/ismdou - Service Finder Analysis (stakeholder ' does not exist (neither on disk nor in the index)
fatal: path 'Workspace/Users/username@example.com/Accelerators/Wide&Deep Recommender/MR 01' does not exist (neither on disk nor in the index)
fatal: path 'Workspace/Users/username@example.com/Accelerators/Wide&Deep Recommender/MR 02' does not exist (neither on disk nor in the index)
fatal: path 'Workspace/Users/username@example.com/Accelerators/Wide&Deep Recommender/MR 03' does not exist (neither on disk nor in the index)

The first fatal is due to double quotes in the filename. The rest are due to colon in the filename (not visible, but right after the path in the error message; filename is truncated in the error message).

PS: The worst of it is that the commit goes through, so you end up with a commit that should have encrypted files, but instead files are in plain text.

jmurty commented 8 months ago

Hi, I have done some testing and the problem is triggered by the presence of quote characters (") in filenames, how these quotes are output by Git commands and then fed into follow-on commands within Transcrypt.

At multiple places in the Transcrypt script we use a command like this to find encrypted files (tracked files to which the crypt filter is applied):

git -c core.quotePath=false ls-files | git -c core.quotePath=false check-attr --stdin filter | awk 'BEGIN { FS = ":" }; /crypt$/{ print $1 }'

These commands try to avoid problems with special/unusual characters in filenames by disabling Git's core.quotePath config setting, but unfortunately per the documentation:

Double-quotes, backslash and control characters are always escaped regardless of the setting of this variable.

Running the above command on a repo containing files with quote characters produces output like this:

"ismdou - Know the identifiers of the table \"FOO\".py.secret"
sensitive_file

It is because the ismdou… file gets quoted and quote-escaped in this way that follow-up commands fail.

Specifically for your reported issue, the pre-commit hook is failing because it's passing the quoted file path/name to a command git show :"${secret_file}" to determine whether the file is properly encrypted, but this gets interpreted as the following which doesn't work: git show :'"ismdou - Know the identifiers of the table \"FOO\".py.secret"'

Although the pre-commit hook is the culprit it this case, the same problem manifests for the --list command (shows over-quoted output) and the --show-raw command (will not work if you name the file, and also fails if you use the wildcard --show-raw=* option):

./transcrypt --show-raw=*
==> "ismdou - Know the identifiers of the table \"FOO\".py.secret" <==
fatal: path '"ismdou - Know the identifiers of the table \"FOO\".py.secret"' does not exist in 'HEAD'

So I can tell what the problem is, but how to fix it is less clear. We need to do one of:

I have experimented with using the -z option instead of -c core.quotePath=false to identify encrypted files via ls-files and check-attr Git commands. The -z option use NUL to delimit filenames, instead of newlines, but importantly avoids quoting file names at all (emphasis mine):

Without the -z option, pathnames with "unusual" characters are quoted as explained for the configuration variable core.quotePath (see git-config[1]). Using -z the filename is output verbatim and the line is terminated by a NUL byte.

The trick will be figuring out how to string together the commands to work with NUL-delimited outputs, especially given that the command sequence relies on adding – then removing – the suffix : filter: crypt to the same line as the original filename of encrypted files.

So far I've gotten as far as the following, which avoids the unwanted filename quoting using -z with the first ls-files command, but then unfortunately adds it back in with the following check-attr command:

git ls-files -z | awk 'BEGIN { RS = "\0" }; { print $0 }' | git check-attr --stdin filter

I suspect a proper fix will require replacing the single-line piped commands with a function instead that will iterate over unquoted filenames (thanks to the -z option), run the check-attr command on each one individually to identify encrypted files, then add the unquoted filename of just the encrypted files to an output string.

jmurty commented 8 months ago

I've started working towards a fix for this in PR #174

In my manual testing the PR works for most situations, but tests are failing for a few use-cases that have been broken by my changes.

jmurty commented 8 months ago

The potential fix on branch 173-handle-quotes-in-filenames is now passing all unit tests and works for my manual testing of file names containing double-quotes. Can you try it and see if it works for you @andreineculau?