aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.32k stars 4.07k forks source link

Provide an option to perfom unicode normalization on local file names #1639

Open aaugustin opened 8 years ago

aaugustin commented 8 years ago

Summary

aws s3 sync doesn't play well with HFS+ unicode normalization on OS X. I suggest to add an option to normalize file names read locally in normal form C before doing anything with them.

Reproduction steps

  1. Create a file on S3 containing an accented character. For reasons that will become apparent later, do this on a Linux system.

    (linux) % echo test > test/café.txt
    (linux) % aws s3 sync test s3://<test-bucket>/test
  2. Synchronize that file on a Mac.

    (OS X) % aws s3 sync s3://<test-bucket>/test test
    download: s3://<test-bucket>/test/café.txt to test/café.txt
  3. Synchronize it back to S3.

    (OS X) % aws s3 sync s3://<test-bucket>/test test
    upload: test/café.txt to s3://<test-bucket>/test/café.txt
    • Expected result: no upload because the file is identical locally and on S3: I was just sync'd!
    • Actual result: the file is uploaded again.

At this point the file shows up twice in S3!

screen shot 2015-11-14 at 22 45 38

Why this happens

Unicode defines two normal forms — NFC and NFD — for some characters, typically accented characters which are common in Western European languages and even occur in English.

The documentation of unicodedata.normalize, the Python function that converts between the two forms, has a good explanation.

A quick illustration:

>>> "café".encode('utf-8')
b'caf\xc3\xa9'
>>> unicodedata.normalize('NFC', "café").encode('utf-8')
b'caf\xc3\xa9'
>>> unicodedata.normalize('NFD', "café").encode('utf-8')
b'cafe\xcc\x81'

The default filesystem of OS X, HFS+, enforces something that resembles NFD. (Let's say I haven't encountered the difference yet.)

Pretty much everything else, including typing on a keyboard on Linux or OS X, uses NFC. I'm not sure about Windows.

Of course this is entirely HFS+'s fault, but since OS X is a popular system among your target audience, I hope you may have some interest in providing a solution to this problem.

What you can do about it

I think a --normalize-unicode option (possibly with a better name) for aws s3 sync would be useful. It would normalize file names read from the local filesystem with unicodedata.normalize('NFKC', filepath).

Its primary purpose would be to interact with S3 on OS X and have file names in NFC form on S3, which is what the rest of the world expects and will cause the least amount of problems.

I don't know aws cli well enough to tell which other parts could use this option. I just encountered the problem when trying to replace "rsync to file server" with "aws s3 sync to S3".

FWIW rsync provides a solution to this problem with the --iconv option. A common idiom is --iconv=UTF8-MAC,UTF8 when rsync'ing from OS X to Linux and --iconv=UTF8,UTF8-MAC when rsync'ing from Linux to OS X. UTF8-MAC is how rsync calls the encoding of file names on HFS+.

However this isn't a good API to tackle the specific problem I'm raising here. This API is about the encoding of file names. The bug is related to Unicode normalization. These are different concepts. UTF8-MAC mixes them.

Thanks!

aaugustin commented 8 years ago

For what it's worth, the following patch solves my problem:

diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py   2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None):
+                 page_size=None, normalize_unicode=False, result_queue=None):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
         self.page_size = page_size
+        self.normalize_unicode = normalize_unicode
         self.result_queue = result_queue
         if not result_queue:
             self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py   2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
                        'grants': None, 'only_show_errors': False,
                        'is_stream': False, 'paths_type': None,
                        'expected_size': None, 'metadata_directive': None,
-                       'ignore_glacier_warnings': False}
+                       'ignore_glacier_warnings': False,
+                       'normalize_unicode': False}
         self.params['region'] = params['region']
         for key in self.params.keys():
             if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py    2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
                  CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
                  CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]

 def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
                                        operation_name,
                                        self.parameters['follow_symlinks'],
                                        self.parameters['page_size'],
+                                       self.parameters['normalize_unicode'],
                                        result_queue=result_queue)
         rev_generator = FileGenerator(self._client, '',
                                       self.parameters['follow_symlinks'],
                                       self.parameters['page_size'],
+                                      self.parameters['normalize_unicode'],
                                       result_queue=result_queue)
         taskinfo = [TaskInfo(src=files['src']['path'],
                              src_type='s3',

I'm not submitting it as a PR because it's missing at least tests and documentation. I'm mostly leaving it here in case others find it helpful.

Of course, feel free to use it as a starting point for fixing this issue if my approach doesn't seem too off base.

EDIT: just updated the patch to apply unicode normalization before sorting file names.

JordonPhillips commented 8 years ago

Wow, nice work! We'll look into it

aaugustin commented 8 years ago

I created a branch and opened a pull request in order to make it easier to maintain the patch -- the recent release broke it.

aaugustin commented 7 years ago

Here's a new version of the patch, recreated against the latest release.

In case someone else uses it:

diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py   2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None):
+                 page_size=None, normalize_unicode=False, result_queue=None):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
         self.page_size = page_size
+        self.normalize_unicode = normalize_unicode
         self.result_queue = result_queue
         if not result_queue:
             self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py   2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
                        'grants': None, 'only_show_errors': False,
                        'is_stream': False, 'paths_type': None,
                        'expected_size': None, 'metadata_directive': None,
-                       'ignore_glacier_warnings': False}
+                       'ignore_glacier_warnings': False,
+                       'normalize_unicode': False}
         self.params['region'] = params['region']
         for key in self.params.keys():
             if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py    2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
                  CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
                  CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]

 def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
                                        operation_name,
                                        self.parameters['follow_symlinks'],
                                        self.parameters['page_size'],
+                                       self.parameters['normalize_unicode'],
                                        result_queue=result_queue)
         rev_generator = FileGenerator(self._client, '',
                                       self.parameters['follow_symlinks'],
                                       self.parameters['page_size'],
+                                      self.parameters['normalize_unicode'],
                                       result_queue=result_queue)
         taskinfo = [TaskInfo(src=files['src']['path'],
                              src_type='s3',
BenAbineriBubble commented 7 years ago

Thanks for the excellent analysis Aymeric, this is exactly the issue I'm experiencing and it was difficult to track down.

I hope somebody from AWS can help us here.

aaugustin commented 7 years ago

Updated version of the patch against the latest release.

commit 78640c7f7a345fb3740b72c239007470a5709caf
Author: Aymeric Augustin
Date:   Tue Dec 20 23:05:49 2016 +0100

    Add an option to normalize file names.

diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index d33b77f..13a7f1d 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode

     def call(self, files):
         """
@@ -170,6 +173,8 @@ class FileGenerator(object):
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -189,6 +194,8 @@ class FileGenerator(object):
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 4bc7398..04afe3f 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -417,6 +417,14 @@ REQUEST_PAYER = {
     )
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -424,7 +432,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+                 NORMALIZE_UNICODE]

 def get_client(session, region, endpoint_url, verify, config=None):
@@ -963,12 +972,14 @@ class CommandArchitecture(object):
             'client': self._source_client, 'operation_name': operation_name,
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
ishikawa commented 7 years ago

This patch is perfect for me, thanks. 👍

aaugustin commented 7 years ago

Patch rebased on top of develop.

commit c5466f2191b073303edef62d531761591e7e6c90
Author: Aymeric Augustin <aymeric.augustin@m4x.org>
Date:   Tue Dec 20 23:05:49 2016 +0100

    Add an option to normalize file names.

diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index f24ca187..70a17581 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode

     def call(self, files):
         """
@@ -170,6 +173,8 @@ class FileGenerator(object):
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@ class FileGenerator(object):
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 02d591ea..b9b1d6c9 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -418,6 +418,14 @@ REQUEST_PAYER = {
     )
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -425,7 +433,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+                 NORMALIZE_UNICODE]

 def get_client(session, region, endpoint_url, verify, config=None):
@@ -964,12 +973,14 @@ class CommandArchitecture(object):
             'client': self._source_client, 'operation_name': operation_name,
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
JordonPhillips commented 6 years ago

I had a bit of free time this morning so I took a look at this. It doesn't look like this will work since we will need to operate on those files down the line and having the altered path will break that. I think the changes necessary to fully support this feature would need to be more invasive.

ASayre commented 6 years ago

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

This entry can specifically be found on UserVoice at: https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168379-provide-an-option-to-perfom-unicode-normalization

salmanwaheed commented 6 years ago

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its recipients. This is a temporary error. The following address(es) deferred:

mkdirenv@gmail.com Domain salmanwaheed.info has exceeded the max emails per hour (163/150 (108%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------ Received: from github-smtp2-ext1.iad.github.net ([192.30.252.192]:34761 helo=github-smtp2a-ext-cp1-prd.iad.github.net) by box1177.bluehost.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from noreply@github.com) id 1ej0Pc-001aoJ-Eq for hello@salmanwaheed.info; Tue, 06 Feb 2018 03:23:40 -0700 Date: Tue, 06 Feb 2018 02:23:29 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1517912609; bh=s25/ZHjWhyhYV9V97C8YTJNZ5BORhSs5xPzdklFZIKk=; h=From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID: List-Archive:List-Post:List-Unsubscribe:From; b=Z5vLfuztlKa3gUlFxh+rQiu6Swt+G7hinUV/cSIOkbzYfAWamnhD0ULyBqsv52peJ stwTFQoWt4in2Tf4AhG9ZXAivaotPW0i81bIOZjiXnFd8vfgaVj0s3bxRpwx4Tj/6r FuFEFp5+1eaUj88/4+viBqt+X152syrZ3YEkGWjo= From: Andre Sayre notifications@github.com Reply-To: aws/aws-cli reply@reply.github.com To: aws/aws-cli aws-cli@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Message-ID: aws/aws-cli/issue/1639/issue_event/1459789997@github.com In-Reply-To: aws/aws-cli/issues/1639@github.com References: aws/aws-cli/issues/1639@github.com Subject: Re: [aws/aws-cli] Provide an option to perfom unicode normalization on local file names (#1639) Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="--==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1"; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list X-GitHub-Sender: ASayre X-GitHub-Recipient: salmanwaheed X-GitHub-Reason: subscribed List-ID: aws/aws-cli List-Archive: https://github.com/aws/aws-cli List-Post: mailto:reply@reply.github.com List-Unsubscribe: mailto:unsub+00ef1b3886c2f355df86ecca0a66fe83b63582510a0cc5b792cf000000011691442192a169ce06f89887@reply.github.com, https://github.com/notifications/unsubscribe/AO8bOM9ETFXf7BbCu4Gt-bci8Pk4jmUHks5tSCghgaJpZM4Gibvq X-Auto-Response-Suppress: All X-GitHub-Recipient-Address: hello@salmanwaheed.info X-Spam-Status: No, score=0.5 X-Spam-Score: 5 X-Spam-Bar: / X-Ham-Report: Spam detection software, running on the system "box1177.bluehost.com", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see root\@localhost for details.

Content preview: Closed #1639. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/aws/aws-cli/issues/1639#event-1459789997 Closed #1639. [...]

Content analysis details: (0.5 points, 5.0 required)

pts rule name description


0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [URIs: github.com] -0.5 SPF_PASS SPF: sender matches SPF record 0.0 HTML_MESSAGE BODY: HTML included in message 0.7 HTML_IMAGE_ONLY_20 BODY: HTML: images with 1600-2000 bytes of words -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 2.5 DCC_CHECK No description available. -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -2.1 AWL AWL: Adjusted score from AWL reputation of From: address X-Spam-Flag: NO

----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit

Closed #1639.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/aws/aws-cli/issues/1639#event-1459789997 ----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit

Closed #1639.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1--

salmanwaheed commented 6 years ago

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its recipients. This is a temporary error. The following address(es) deferred:

mkdirenv@gmail.com Domain salmanwaheed.info has exceeded the max emails per hour (162/150 (108%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------ ------ The body of the message is 6170 characters long; only the first ------ 5000 or so are included here. Received: from github-smtp2-ext1.iad.github.net ([192.30.252.192]:34195 helo=github-smtp2a-ext-cp1-prd.iad.github.net) by box1177.bluehost.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from noreply@github.com) id 1ej0Pb-001aoA-8m for hello@salmanwaheed.info; Tue, 06 Feb 2018 03:23:39 -0700 Date: Tue, 06 Feb 2018 02:23:28 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com; s=pf2014; t=1517912608; bh=Y/hd9JmoeMXxH6KcRXvfPyHL6nLfCP0pkkFmBhdNXcw=; h=From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID: List-Archive:List-Post:List-Unsubscribe:From; b=cAiSo4/7KEkv8Y09Jc9toFjiBRsftUbnU6o4wAN3r99MK75KQdvfWNMs47IuPeIUc iLCjtWYRi66OiNWPx41icZ/f1wzH67rnKH4BuzQh6wgR//S+gtQfFyNCEHUh7Y+fHN bzgdujckmQC6NeZe79OADG6IM+i3wW0Cx/+8B6sw= From: Andre Sayre notifications@github.com Reply-To: aws/aws-cli reply@reply.github.com To: aws/aws-cli aws-cli@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Message-ID: aws/aws-cli/issues/1639/363377996@github.com In-Reply-To: aws/aws-cli/issues/1639@github.com References: aws/aws-cli/issues/1639@github.com Subject: Re: [aws/aws-cli] Provide an option to perfom unicode normalization on local file names (#1639) Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="--==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875"; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: list X-GitHub-Sender: ASayre X-GitHub-Recipient: salmanwaheed X-GitHub-Reason: subscribed List-ID: aws/aws-cli List-Archive: https://github.com/aws/aws-cli List-Post: mailto:reply@reply.github.com List-Unsubscribe: mailto:unsub+00ef1b3846cf8d2c826fcd2da1df396c9316499bb49bdbe792cf000000011691442092a169ce06f89887@reply.github.com, https://github.com/notifications/unsubscribe/AO8bOGxOP_4Qx_TAGx-UXBEgDiRQuEKBks5tSCgggaJpZM4Gibvq X-Auto-Response-Suppress: All X-GitHub-Recipient-Address: hello@salmanwaheed.info X-Spam-Status: No, score=-1.1 X-Spam-Score: -10 X-Spam-Bar: - X-Ham-Report: Spam detection software, running on the system "box1177.bluehost.com", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see root\@localhost for details.

Content preview: Good Morning! We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI. [...]

Content analysis details: (-1.1 points, 5.0 required)

pts rule name description


0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block for more information. [URIs: github.com] -0.5 SPF_PASS SPF: sender matches SPF record 0.0 HTML_MESSAGE BODY: HTML included in message -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.5 AWL AWL: Adjusted score from AWL reputation of From: address X-Spam-Flag: NO

----==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Good Morning!

We're closing this issue here on GitHub, as part of our migration to Use= rVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it eas= ier to search for and show support for the features you care the most abo= ut, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is p= osted, people can vote on the ideas, and the product team will be respond= ing directly to the most popular suggestions.

We=E2=80=99ve imported existing feature requests from GitHub - Search for= this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sa= ke. As it=E2=80=99s a text-only import of the original post into UserVoi= ce, we=E2=80=99ll still be keeping in mind the comments and discussion th= at already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs. =

Once again, this issue can now be found by searching for the title on: ht= tps://aws.uservoice.com/forums/598381-aws-command-line-interface =

-The AWS SDKs & Tools Team

-- =

You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/aws/aws-cli/issues/1639#issuecomment-363377996=

----==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Good Morning!

We're closing this issue here on GitHub, as part of our migration to <= a href=3D"https://aws.uservoice.com/forums/598381-aws-command-line-interf= ace" rel=3D"nofollow">UserVoice for feature requests involving the AW= S CLI.

This will let us get the most important features to you, by making it = easier to search for and show support for the features you care the most = about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea i= s posted, people can vote on the ideas, and the product team will be resp= onding directly to the most popular suggestions.

We=E2=80=99ve imported existing feature requests from GitHub - Search = for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's= sake. As it=E2=80=99s a text-only import of the original post into User= Voice, we=E2=80=99ll still be keeping in mind the comments and discussion= that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on:= https://aws.uservoice.com/forums/598381-aws-comma= nd-line-interface

-The AWS SDKs & Tools Team

<p style=3D"font-size:small;-webkit-text-size-adjust:none;color:#666;">&m= dash;
You are receiving this because you are subscribed to this thre= ad.
Reply to this email directly, <a href=3D"https://github.com/aws/= aws-cli/issues/1639#issuecomment-363377996">view it on GitHub, or <a = href=3D"https://github.com/notifications/unsubscribe-auth/AO8bOC976GYj3UV= 8WvsNlQnu_09eegh2ks5tSCgggaJpZM4Gibvq">mute the thread.<img alt=3D"" = height=3D"1" src=3D"https://github.com/notifications/beacon/AO8bOOCYeob5q= Ex--sRg66CGL3nhM2rLks5tSCgggaJpZM4Gibvq.gif" width=3D"1" />

<div itemscope itemtype=3D"http://schema.org/EmailMessage"> <div itemprop=3D"action" itemscope itemtype=3D"http://schema.org/ViewActi= on"> <link itemprop=3D"url" href=3D"https://github.com/aws/aws-cli/issues/16= 39#issuecomment-363377996"> <meta itemprop=3D"name" content=3D"View Issue">
<meta itemprop=3D"description" content=3D"View this Issue on GitHub"></me= ta>

<script type=3D"application/json" data-scope=3D"inboxmarkup">{"api_versio= n":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name"= :"GitHub"},"entity":{"external_key":"github/aws/aws-cli","title":"aws/aws= -cli","subtitle":"GitHub repository","main_image_url":"https://cloud.gith= ubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c= 7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/= 143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name= ":"Open in GitHub","url":"https://github.com/aws/aws-cli"}},"updates":{"s= nippets":[{"icon":"PERSON","message":"@ASayre in #1639: Good Morning!\r\n= \r\nWe're closing this issue here on GitHub, as part of our migration to = UserVoice for feature requests involving the AWS CLI.\r\n\r\nThis will let u= s get the most important features t

aaugustin commented 6 years ago

Patch updated (again).

diff -Naur awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py    2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py 2018-03-04 21:31:07.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode

     def call(self, files):
         """
@@ -170,6 +173,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -Naur awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py  2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/subcommands.py   2018-03-04 21:33:41.000000000 +0100
@@ -427,6 +427,15 @@
     )
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on macOS.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -435,7 +444,7 @@
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS, NO_PROGRESS,
                  PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
-                 REQUEST_PAYER]
+                 REQUEST_PAYER, NORMALIZE_UNICODE]

 def get_client(session, region, endpoint_url, verify, config=None):
@@ -978,12 +987,14 @@
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
             'result_queue': result_queue,
+            'normalize_unicode': self.parameters['normalize_unicode'],
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
             'result_queue': result_queue,
+            'normalize_unicode': self.parameters['normalize_unicode'],
         }

         fgen_request_parameters = \
aaugustin commented 3 years ago

FTR the last version of the patch still works.