cldf / csvw

CSV on the web
Apache License 2.0
37 stars 6 forks source link

Dialect.as_python_formatting_parameters() escapechar not in line with CSVW spec #7

Closed xflr6 closed 6 years ago

xflr6 commented 6 years ago

Currently, we set Python escapechar to \\ whenever CSVW quoteChar is set:

https://github.com/cldf/csvw/blob/f0869818eb8c34949809f1b24a7e1d30cd0adb51/csvw/dsv_dialects.py#L112

However, the spec says:

escape character The string that is used to escape the quote character within escaped cells, or null, set by the doubleQuote property of a dialect description. The default is " (such that "" is used to escape " within an escaped cell).

Python csv is in line with this (apart from what variable controls what) :

Dialect.escapechar A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.

IIUC, QUOTE_NONE is not allowed by CSVW at all and Python escapechar should not be set with doublequote=True (i.e. Python doublequote=True, escapechar=None corresponds to CSVW doubleQuote=True, which sets escape character to ").

In other words, the line above would need to be changed to:

'escapechar': self.escape_character if not self.doubleQuote else None,

In this case the following test cases would be ill-formed:

https://github.com/cldf/csvw/blob/f0869818eb8c34949809f1b24a7e1d30cd0adb51/tests/test_metadata.py#L57

https://github.com/cldf/csvw/blob/f0869818eb8c34949809f1b24a7e1d30cd0adb51/tests/test_dsv.py#L147

Side note: AFAIU, neither Python nor CSVW escape the delimiter, so even with doubleQuote=False, the writer would not produce y\\,x (escaped delimiter) but rather "y,x" (quoted field) for cells containing the delimiter, although the former is parsed correctly by Python csv:

In [1]: import io, csv
    ...: f = io.StringIO('spam\\,eggs,spam', newline='')
    ...: reader = csv.reader(f, doublequote=False, escapechar='\\')
    ...: next(reader)
Out[1]: ['spam,eggs', 'spam']
xflr6 commented 6 years ago

https://www.w3.org/TR/tabular-data-model/#parsing

Otherwise, if the string starts with the escape character and the escape character is not the same as the quote character, append the character following the escape character to the current cell value and move on to process the string following that character.

xrotwang commented 6 years ago

But CSVW does allow for a quote character of null. Wouldn't this translate to quoting=csv.QUOTE_NONE?

Apart from this, I tend to agree with your proposed change; but I vaguely remember changing the behaviour based on a particular real-life case I encountered.

xrotwang commented 6 years ago

Even so, I'd say we change the code to be more in line with the spec (and python's docunented behaviour), and wait for any problems to re-surface.