Implement charset conversion on recording

Scribery / tlog

Terminal I/O logger

GNU General Public License v2.0

308 stars 52 forks source link

As users can use ono of the multitude of character encodings and JSON only supports UTF-8, tlog-rec needs to convert the received data accordingly.

An invocation of "locale charmap" can return the used encoding name, but might be too burdensome and a lighter way may need to be found.

We can use iconv for conversion, but we need a way to detect and extract invalid characters for separate, binary storage. Iconv seems to be able to stop at the first invalid byte, which can be used to implement that. We can use two iconv descriptors: one normal and another discarding invalid characters (with "//IGNORE" encoding suffix) to somehow extract invalid bytes. However we need to check how far back all the required functionality goes and if it's available in RHEL6.

This problem is similar to #10 in that converting anything in the recorded data would make the recording inexact and would greatly complicate recovery of binary data.

OTOH, we need to have the text in UTF-8 so that ElasticSearch can index and search it. So, to have both, i.e. to have ElasticSearch index the data and preserve the recorded data exactly, we either should limit the terminal charset to UTF-8, or log both the binary recording (e.g. in base64) and cleaned-up and converted text recording.

Perhaps the latter is what we should do. The latter will also require recording the original terminal charset.

In the meantime we can stay with the existing format, limit the encoding to UTF-8, add format version according to #15 and later implement recording both.

Scribery / tlog

Implement charset conversion on recording #7