Closed nknapp closed 7 years ago
Looking at the code, it seems the dsv_to_sheet_str function does a check for determining the field separator - and it checks for the existence of any tab (\t) character before checking for a comma (,). As such, it finds your embedded tab and assumes a tab-delimited file.
A quick search for other methods for determining the delimiter seem to settle on interrogating the first X lines of the file and using which ever delimiter occurs the most frequently across all lines. In your file's case, it would be a comma (,).
I will see about putting together a prototype solution and post it as a pull request.
Edit - Just noticed the following on the main github page:
Format | Test |
---|---|
HTML | starts with <html |
XML | starts with < |
DSV | starts with /sep=.$/, separator is the specified character |
TSV | one of the first 1024 characters is a tab char "\t" |
CSV | one of the first 1024 characters is a comma char "," |
PRN | (default) |
Guess this is "by design" behavior. Still, willing to give it a shot to build a "delimiter determiner" routine.
@SlyW @nknapp the issue was correctly diagnosed -- the current approach scans the leading characters and assumes it is tab-separated if the tab character appears in the first 1024 chars. That probably should be changed.
@SlyW if you are willing to explore this further, the "delimiter determiner" routine would need to disregard characters within quotes:
a,b\tc",,,,,\t\t\t\t\t"\td,e
|-- disregard --|
Note: If you can control the origin, the best approach is to force the delimiter by prepending with a sep=,
header.
@SheetJSDev - concur that sep=
at the top of the file is the best method. Unfortunately, complete control over the file's origin is elusive at best.
I have a terribly naive approach which I will adjust to account for the quoted delimiters. Is it safe to (naively) split the incoming file in \n\r for the purpose of identifying "rows" of data to scan? Technically, a delimited file could support embedded CRLF values (quoted of course) ... however, it seems there are other places where the naive row splitting approach was used.
DSV is read character-by-character -- there's a boolean instr
which keeps track of whether the current character is in a quoted string.
Other formats use the naive approach because its simpler:
Thank you @SheetJSDev. It seems, then, it would be prudent to spin through the whole of the file twice. The first time to ascertain the delimiters, the second to actually process the file. I can still use a limiting row count to only scan the first "n" rows...
We changed the resolution in lines with this discussion. Based on the first 1024 bytes, the parser counts the number of unquoted commas and tabs and assumes the more-frequent character is the delimiter.
Thanks for the fix. I'll try it out next week. By "prepending sep=,
" you mean just literally prepending it to the to the file contents?
This might be feasible for us as well, but I have to check.
It's an Excel feature not covered in RFC4180. The hint tells Excel to use the specified delimiter, and Excel knows to skip that first "row". It even works with other delimiters! Here's an example with a pipe character:
sep=|
1|2|3
4|5|6
If you have control over the origin of the data, the sep metadata is really useful when dealing with international users.
When parsing a CSV-file that contains a tab-character (in this case within a cell), the whole row is used as value for the first cell in the row, even though comas and quotes exist.
I have attached the example file ('test.csv') as test.csv.txt.
In order to reproduce this, please rename the file to 'test.csv' and then run the following snippet
In my case (with version
0.10.8
) the (incorrect) output isWe have a (rather hacky) workaround for this:
read
onceSheets
andSheetNames
appear as keys in the result)read
againThis works short-term, but it is pretty ugly. I just wanted to say that there is no hard pressure since we have a temporary solution to the problem.