Document what the quoting mechanism is

dbro / csvquote

Enables common unix utlities like cut, awk, wc, head to work correctly with csv data containing delimiters and newlines

MIT License

446 stars 24 forks source link

After reading only the README, I wasn't able to tell what the sanitized output would be for the given examples.

Could you document that sanitize mode converts field separators (commas by default) inside fields into character 0x1F (US = Unit Separator), and record separators (newlines by default) into 0x1E (RS = Record Separator), and restore mode converts them back?

Note: If the original file has 0x1F or 0x1E characters, they'll be replaced with commas and newlines when the file is converted back.

Note^2: The program operates on bytes, so some UTF-8 characters may get mangled.

Note^3: After reading the other (closed) issues, I see others have asked the same question; this issue still stands, though: if it's documented, people will stop asking. :-)

Thanks for your suggestion. Regarding your first point, the documentation currently has this explanation:

"By using csvquote, you temporarily replace the special characters inside quoted fields with harmless nonprinting characters that can be processed as data by regular text tools. At the end of processing the text, these nonprinting characters are restored to their previous values."

I didn't add an example of the intermediate sanitized state of the data, but I could do that in the future. The pipeline should always have a "| csvquote -u" at the end, so I wouldn't expect people to need to know about this intermediate state.

Regarding your second note, the source code has this as a comment on line 14: TODO: verify that it handles multi-byte characters and unicode and utf-8 etc Are you seeing any problems working with UTF-8 format data? If so, please share the data file and describe the expected behavior.

dbro / csvquote

Document what the quoting mechanism is #16