abadger commented 6 years ago

Proposal: Encodings for file contents

Author: Toshio Kuratomi <@abadger> IRC: abadger1999

Date: 2018/06/24

Status: Approved, implementation in progress
Proposal type: Plugin Design
Targeted Release: 2.7
Associated PR: <link to GH PR in ansible/proposals if PR was submitted>
Estimated time to implement:
- Add to guidelines: 3 week of sparse work (lots of waiting on other people to give feedback)
- Code for the template module: 1 week inclusive of testing [Done and merged]
- Other modules/plugin: outside the scope of this proposal. Others will use the guidelines if they want to do that

Motivation

In Ansible, we default to UTF-8 as the encoding of any data that we get from outside systems which do not have an explicit encoding. We will likely never change this for things which are clearly part of Ansible's underlying structure (Playbooks, inventory, vars). However, there are also things outside of Ansible's structure where a different encoding may be needed.

Problems

What problems exist that this proposal will solve?

A managed machine has users who use a non-utf8 encoding as their default locale (for instance, LANG=ja_JP.eucJP The administrator wants to use Ansible to place a README file into new user's home directories with rules on using the machine written in Japanese. These files need to be encoded in eucJP in order for the users to be able to read them sensibly.
A managed machine has a program which uses latin-1 for all of its files. The administrator wants to add a configuration file which uses non-ascii, latin-1 characters in it.
A managed machine has a program uses latin-1 for all of its files. The administrator wants to add a line to a configuration file which already uses non-ascii, latin-1 characters.

Solution proposal

Modules and plugins which read or modify file contents may add parameters to tell Ansible the encoding of these files. A module or plugin may add a parameter for input, a parameter for output, or two parameters, one for input and one for output. A module or plugin must not add a single parameter that handles both input and output. The name of the input parameter should be input_encoding. The name of the output parameter should be output_encoding. The default of these modules must be to use utf-8 if these parameters are not set.

As part of this proposal, the template module will grow an output_encoding parameter for the 2.7 release. Template files on the controller will still be encoded in utf-8 but the user will be able to choose what encoding the output file will be. Other modules and plugins may implement these parameters but that work is not being scoped as part of acceptance of this proposal.

Testing (optional)

The template module must grow integration tests for:

output non-ascii to utf-8, the default
output non-ascii to latin-1 utilizing the output_encoding parameter

Documentation (optional)

Coding guidelines for modules and plugin need to be updated to state the rules about input and output encodings given in the solution proposal. The documentation should also include examples. Template, copy, and lineinfile would make good examples.

Template reads files as utf-8, operates on the text inside of the template file with other variables that Ansible knows about, and then attempts to write the file on the remote machine with the user's given output_encoding.
Copy reads a file as bytes and then writes the files as bytes on the remote machine. Copy never has to understand the contents of the file so it never needs an input or output encoding.
lineinfile reads a file from disk and treats it as text which it then examines to decide where to place a new line of text. The text is then written back out to the file on the remote machine. lineinfile may provide both an input_encoding and output_encoding for users to work with the file contents.

Anything else?

Problems not solved

Files with mixed encoding

At the Contributor Summit at AnsibleFest San Francisco 2017 and after, it was discussed whether Ansible should support textual files which have multiple encodings. (The specific example was an /etc/passwd file which had different encodings on every line.) Catering to this use case was rejected. This could happen accidentally (for instance, if python2 byte strings were used or if python3 with surrogateescape happened to decode the file fine) but it would not be considered a feature that we would keep working or a bug if it did not work. Textual files must be a single encoding.

Users would be able to work around this by either treating these as binary files with tools/modules/plugins built for operating on binary files (not built yet, but we'd accept modules for this).

Non-utf8 Filesystem paths

UNIX filesystems treat file paths as strings of bytes. As such, a file path may be composed of text in any encoding or even byte value with no intended text representation. A single path (even a single filename) may not be decodable in utf-8 or even any single encoding. This proposal does not attempt to address this problem.

Vars files, inventory, playbooks, and other Ansible resources

This proposal makes no attempt to change the format of files and resources which are Ansible's to define the format of. These are still always encoded under utf-8.

webknjaz commented 6 years ago

I like this proposal.

abadger commented 6 years ago

At today's meeting, this was approved.

pjmcquade commented 6 years ago

Thank you, we were one of the groups impacted by this at work. We had to us 'iconv' as a kludge to fix this. For the sake of others reading this thread:

- name: get the list of files in our playbook templates directory and subdirectories
  shell: |
    find templates -type f
  register: file_list

- debug: msg="file list {{ file_list.stdout_lines }}"

- name: obtaining file stat info on each file in our templates directory
  stat:
    path: "{{ item }}"
  register: f
  with_items: "{{ file_list.stdout_lines }}"

- debug:
    msg: "file {{ item.stat.path }} charset is {{ item.stat.charset }}"
  with_items: "{{ f.results }}"

# Once these files are converted, this code will never match again on files of
# type iso-8859-1 as the following converts them into UTF-8 in place.  You may
# want to back them up first....up to you.
# The following has only been tested on Red Hat.
- name: converting files in place to utf-8 format if they are of character set iso-8859-1
  shell: |
    iconv -f ISO-8859-1 -t UTF-8 {{ item.stat.path }} -o {{ item.stat.path }}
  register: iconv_status
  with_items: "{{ f.results }}"
  when: '"iso-8859-1" in item.stat.charset'
  failed_when: iconv_status.rc != 0

- name: now we look at our (hopefully) converted file list again
  stat:
    path: "{{ item }}"
  register: f
  with_items: "{{ file_list.stdout_lines }}"

- debug:
    msg: "file {{ item.stat.path }} charset is {{ item.stat.charset }}"
  with_items: "{{ f.results }}"

Hope this was somewhat useful.

abadger commented 5 years ago

Thanks to @birdypme output_encoding has bee implemented for the template module for 2.7: https://github.com/ansible/ansible/pull/42171

ansible / proposals

Encodings for file contents #121