ansible / proposals

Repository for sharing and tracking progress on enhancement proposals for Ansible.
Creative Commons Zero v1.0 Universal
93 stars 19 forks source link

Encodings for file contents #121

Open abadger opened 6 years ago

abadger commented 6 years ago

Proposal: Encodings for file contents

Author: Toshio Kuratomi <@abadger> IRC: abadger1999

Date: 2018/06/24

Motivation

In Ansible, we default to UTF-8 as the encoding of any data that we get from outside systems which do not have an explicit encoding. We will likely never change this for things which are clearly part of Ansible's underlying structure (Playbooks, inventory, vars). However, there are also things outside of Ansible's structure where a different encoding may be needed.

Problems

What problems exist that this proposal will solve?

Solution proposal

Modules and plugins which read or modify file contents may add parameters to tell Ansible the encoding of these files. A module or plugin may add a parameter for input, a parameter for output, or two parameters, one for input and one for output. A module or plugin must not add a single parameter that handles both input and output. The name of the input parameter should be input_encoding. The name of the output parameter should be output_encoding. The default of these modules must be to use utf-8 if these parameters are not set.

As part of this proposal, the template module will grow an output_encoding parameter for the 2.7 release. Template files on the controller will still be encoded in utf-8 but the user will be able to choose what encoding the output file will be. Other modules and plugins may implement these parameters but that work is not being scoped as part of acceptance of this proposal.

Testing (optional)

The template module must grow integration tests for:

Documentation (optional)

Coding guidelines for modules and plugin need to be updated to state the rules about input and output encodings given in the solution proposal. The documentation should also include examples. Template, copy, and lineinfile would make good examples.

Anything else?

Problems not solved

Files with mixed encoding

At the Contributor Summit at AnsibleFest San Francisco 2017 and after, it was discussed whether Ansible should support textual files which have multiple encodings. (The specific example was an /etc/passwd file which had different encodings on every line.) Catering to this use case was rejected. This could happen accidentally (for instance, if python2 byte strings were used or if python3 with surrogateescape happened to decode the file fine) but it would not be considered a feature that we would keep working or a bug if it did not work. Textual files must be a single encoding.

Users would be able to work around this by either treating these as binary files with tools/modules/plugins built for operating on binary files (not built yet, but we'd accept modules for this).

Non-utf8 Filesystem paths

UNIX filesystems treat file paths as strings of bytes. As such, a file path may be composed of text in any encoding or even byte value with no intended text representation. A single path (even a single filename) may not be decodable in utf-8 or even any single encoding. This proposal does not attempt to address this problem.

Vars files, inventory, playbooks, and other Ansible resources

This proposal makes no attempt to change the format of files and resources which are Ansible's to define the format of. These are still always encoded under utf-8.

webknjaz commented 6 years ago

I like this proposal.

abadger commented 6 years ago

At today's meeting, this was approved.

pjmcquade commented 6 years ago

Thank you, we were one of the groups impacted by this at work. We had to us 'iconv' as a kludge to fix this. For the sake of others reading this thread:

- name: get the list of files in our playbook templates directory and subdirectories
  shell: |
    find templates -type f
  register: file_list

- debug: msg="file list {{ file_list.stdout_lines }}"

- name: obtaining file stat info on each file in our templates directory
  stat:
    path: "{{ item }}"
  register: f
  with_items: "{{ file_list.stdout_lines }}"

- debug:
    msg: "file {{ item.stat.path }} charset is {{ item.stat.charset }}"
  with_items: "{{ f.results }}"

# Once these files are converted, this code will never match again on files of
# type iso-8859-1 as the following converts them into UTF-8 in place.  You may
# want to back them up first....up to you.
# The following has only been tested on Red Hat.
- name: converting files in place to utf-8 format if they are of character set iso-8859-1
  shell: |
    iconv -f ISO-8859-1 -t UTF-8 {{ item.stat.path }} -o {{ item.stat.path }}
  register: iconv_status
  with_items: "{{ f.results }}"
  when: '"iso-8859-1" in item.stat.charset'
  failed_when: iconv_status.rc != 0

- name: now we look at our (hopefully) converted file list again
  stat:
    path: "{{ item }}"
  register: f
  with_items: "{{ file_list.stdout_lines }}"

- debug:
    msg: "file {{ item.stat.path }} charset is {{ item.stat.charset }}"
  with_items: "{{ f.results }}"

Hope this was somewhat useful.

abadger commented 5 years ago

Thanks to @birdypme output_encoding has bee implemented for the template module for 2.7: https://github.com/ansible/ansible/pull/42171