CSV with tab-chars in a cell: Whole row is parsed as single cell

SheetJS / sheetjs

📗 SheetJS Spreadsheet Data Toolkit -- New home https://git.sheetjs.com/SheetJS/sheetjs

https://sheetjs.com/

Apache License 2.0

35.13k stars 8k forks source link

CSV with tab-chars in a cell: Whole row is parsed as single cell #732

Closed nknapp closed 7 years ago

nknapp commented 7 years ago

When parsing a CSV-file that contains a tab-character (in this case within a cell), the whole row is used as value for the first cell in the row, even though comas and quotes exist.

I have attached the example file ('test.csv') as test.csv.txt.

In order to reproduce this, please rename the file to 'test.csv' and then run the following snippet

var xlsx = require('xlsx')
console.log(JSON.stringify(xlsx.readFile('test.csv'),0,2))

In my case (with version 0.10.8) the (incorrect) output is

{
  "SheetNames": [
    "Sheet1"
  ],
  "Sheets": {
    "Sheet1": {
      "A1": {
        "t": "s",
        "v": "cellA1\",\"celA2"
      },
      "A2": {
        "t": "s",
        "v": "value with\ttab\",\"value without tab"
      },
      "!ref": "A1:A2"
    }
  }
}

We have a (rather hacky) workaround for this:

Call read once
Check if it is CSV (only Sheets and SheetNames appear as keys in the result)
Check if the data contains a tab character
Remove the tabs and call read again

This works short-term, but it is pretty ugly. I just wanted to say that there is no hard pressure since we have a temporary solution to the problem.

SlyW commented 7 years ago

Looking at the code, it seems the dsv_to_sheet_str function does a check for determining the field separator - and it checks for the existence of any tab (\t) character before checking for a comma (,). As such, it finds your embedded tab and assumes a tab-delimited file.

A quick search for other methods for determining the delimiter seem to settle on interrogating the first X lines of the file and using which ever delimiter occurs the most frequently across all lines. In your file's case, it would be a comma (,).

I will see about putting together a prototype solution and post it as a pull request.

Edit - Just noticed the following on the main github page:

Format	Test
HTML	starts with <html
XML	starts with <
DSV	starts with /sep=.$/, separator is the specified character
TSV	one of the first 1024 characters is a tab char "\t"
CSV	one of the first 1024 characters is a comma char ","
PRN	(default)

Guess this is "by design" behavior. Still, willing to give it a shot to build a "delimiter determiner" routine.

SheetJSDev commented 7 years ago

@SlyW @nknapp the issue was correctly diagnosed -- the current approach scans the leading characters and assumes it is tab-separated if the tab character appears in the first 1024 chars. That probably should be changed.

@SlyW if you are willing to explore this further, the "delimiter determiner" routine would need to disregard characters within quotes:

a,b\tc",,,,,\t\t\t\t\t"\td,e
      |-- disregard --|

Note: If you can control the origin, the best approach is to force the delimiter by prepending with a sep=, header.

SlyW commented 7 years ago

@SheetJSDev - concur that sep= at the top of the file is the best method. Unfortunately, complete control over the file's origin is elusive at best.

I have a terribly naive approach which I will adjust to account for the quoted delimiters. Is it safe to (naively) split the incoming file in \n\r for the purpose of identifying "rows" of data to scan? Technically, a delimited file could support embedded CRLF values (quoted of course) ... however, it seems there are other places where the naive row splitting approach was used.

SheetJSDev commented 7 years ago

DSV is read character-by-character -- there's a boolean instr which keeps track of whether the current character is in a quoted string.

Other formats use the naive approach because its simpler:

DIF doesn't support newline characters (it dates back to visicalc and the format never envisioned the freedoms of modern utilities).
SYLK explicitly uses newline to separate records, encoding the newline character in a specific way.
PRN is ambiguous (there is not enough information in the file to determine whether a cell stack is really a multi-line cell)

SlyW commented 7 years ago

Thank you @SheetJSDev. It seems, then, it would be prudent to spin through the whole of the file twice. The first time to ascertain the delimiters, the second to actually process the file. I can still use a limiting row count to only scan the first "n" rows...

SheetJSDev commented 7 years ago

We changed the resolution in lines with this discussion. Based on the first 1024 bytes, the parser counts the number of unquoted commas and tabs and assumes the more-frequent character is the delimiter.

nknapp commented 7 years ago

Thanks for the fix. I'll try it out next week. By "prepending sep=," you mean just literally prepending it to the to the file contents? This might be feasible for us as well, but I have to check.

SheetJSDev commented 7 years ago

It's an Excel feature not covered in RFC4180. The hint tells Excel to use the specified delimiter, and Excel knows to skip that first "row". It even works with other delimiters! Here's an example with a pipe character:

sep=|
1|2|3
4|5|6

If you have control over the origin of the data, the sep metadata is really useful when dealing with international users.