dbro / csvquote

Enables common unix utlities like cut, awk, wc, head to work correctly with csv data containing delimiters and newlines
MIT License
446 stars 24 forks source link

Escaped quotes quoted #12

Closed bmcgough closed 7 years ago

bmcgough commented 7 years ago

Not sure how best to solve. Probably need to implement the escape somehow.

Test string (two lines) is

1,2,3,"\"hello",4,5,6
7,8,9

The csvquote code today will translate and un-translate properly, but the interim string is not usable by standard unix tools.

What I think is happening is that csvquote isn't escaping with the backslash, so it interprets the escaped double quote as the end, then the last double quote as the beginning of a quoted string. The output I get has the rest of the buffer with no spaces, commas, or even a newline (looks like: 1,2,3,"\"hello"456789). The non printing characters must be in there, as csvquote -u restores the string to the original.

CSV escaping (typically with backslash) is not standardized, but is common.

Using double quotes rather than a backslash escape seems to work.

I will see about making a patch. I'm not very good with C, but a colleague is and he may be interested in helping.

dbro commented 7 years ago

Hi Ben- thanks for question, I'll try to explain what csvquote is doing with this input data.

The file is read byte-by-byte, and the first thing that csv notices as being interesting is the " character at position 8 in the first line, which tells csv quote that the next characters are all part of one field. Then the character at position 9 is a " character which will be interpreted as the end of the quoted field. So csvquote goes back to passing through all characters until it finds another " at position 14. This character is also interpreted as meaning that the next characters are all part of a single field, which includes all characters from position 15 until the end of the file. This field is never terminated with another " character, so csv quote just ends its output in the middle of what it thinks is a quoted field.

In summary, the input data is in an unexpected format that is not consistent with what the (admittedly unofficial) CSV definition: https://tools.ietf.org/html/rfc4180

Perhaps the input format could be corrected to be as follows: 1,2,3,"""hello",4,5,6 7,8,9

The triple-doublequote means that the 4th field in the first line should be considered to have a value which includes one of the special delimiter characters, in this case the double quote: 4th field value = "hello

Do you have a real-world data set that has this kind of unterminated quoted string? What is the context of the test you created?

Dan

bmcgough commented 7 years ago

I do have a real-world dataset: hundreds of millions of filenames, the format over which I have no control. It is a challenge, as I have learned over the course of this project that when dealing with POSIX file systems across clients (Linux, macOS, Windows), there is essentially only one disallowed character in a file name, and that is '/'. I even have filenames with bytes that do not correspond to any character set - I have to use uconv/iconv to proxy those.

I have encountered backslash-escaped CSV more than once in the past, and the file system crawling tool I am using (https://github.com/fizwit/filesystem-reporting-tools) does just this. I did not find the IETF standard, and I know the owner of pwalk, so I will talk with him about conforming to the standard. I did find the Wikipedia page (https://en.wikipedia.org/wiki/Comma-separated_values) which talks about backslash escapes.

I submitted a pull request with what I think is a fix. The logic is an extension of your character checking implementing backslash (or any other specified character) as an escape for the double quote character following it. It just passes both characters through in this case, not setting isQuoteInEffect to true in these cases. I believe that will work as any character other than double quote does not need to be escaped as they should be quoted.

I will absolutely understand if you don't want to merge the commit, and I do have a longer-term solution by modifying pwalk. If you are interested, I implemented this change to csvquote and have run close to a billion filename through it at this point with no further character issues. I am running both sort and awk after csvquote.

Thanks for an extremely useful tool!

dbro commented 7 years ago

Thanks Ben. Sounds like an interesting data set with a lot of challenges. If you are able to assume that the backslashes always and only occur inside doublequoted fields, and always have a doublequote character after them, then you could replace all instances of backslash with doublequote and then the data set would be in the CSV format expected by the standard version of csvquote.

Here's a quick way to check that assumption. It's a bit rough, in that it may not work properly if there are more than one backslash per line. It might be better to use sed or awk to do a more thorough check.

grep '\' inputfile | grep -c -v '\"'

if it returns a count of zero, then you can proceed:

cat infile | tr '\' '"' | csvquote [...]

bmcgough commented 7 years ago

(Updated my first post as I neglected to put my code in code tags so it was mangled.)

The data I am using is produced by pwalk (linked above). It currently produces backslash-escaped quoted CSV fields. I just queried my database (see below) and found over 2k files with legitimate backslashes in their names. I found a new challenge today - I am using iconv to skip illegal characters to keep the whole chain sane, and I have found more than a handful of files where the file names become identical once the illegal/non-printing characters are removed. Fun.

I am ultimately putting this file data into a PostgreSQL database, and found that the CVS format for the PostgreSQL COPY command can use backslash or double quote as an escape character, so that format may be more widespread than Sybase (according to Wikipedia). My path forward may be to modify pwalk to escape with double quote and use that down the chain, since csvquote and PostgreSQL can both work that way.

dbro commented 7 years ago

Hi Ben - I'm going to close this request now. Please re-open if you think it's necessary. Good luck with your data set Dan