CS3099JH2017 / cs3099jh

CS3099 Junior Honours Project Protocol and Discussion Central Repo
1 stars 6 forks source link

CSV Format Specification #12

Closed db213 closed 7 years ago

db213 commented 7 years ago

CSV does not have a formal definition, but it's important BE and ML agree to a specific, formal format.

My proposal is to use the CSV specification proposed by RFC4180 (yes I'm ripping this from Wikipedia) with a few amendments:

A CSV is plain text file using the character set UTF-8 that:

  1. consists of one record per line.
  2. with the records divided into fields separated by a comma.
  3. where every record has the same number and sequence of fields. Additionally, the first row of the CSV must give the names/labels of each field. If a comma is part of a field, then the field must be surrounding by double quotes ("). If a double quote is part of a field, then it must be escaped with a back slash (\).

I thought this was reasonable as it's easy to parse with any CSV processing package (e.g. Pandas). If this is accepted, this specification should probably be added to the ML and BE specifications: ML should expect to deal with CSVs of this format, and BE should only pass CSVs of this format to ML . If any other format is passed to ML, it is valid behaviour to throw an error.

magnostherobot commented 7 years ago

In your proposal:

If a double quote is part of a field, then it must be escaped with a back slash (\).

Does this include unquoted fields? In other words, should it be:

testing, "1, \"quoted\"", help "me"

or:

testing, "1, \"quoted\"", help \"me\"
db213 commented 7 years ago

Yes this included unquoted fields, so the second example is correct.