Open smhk opened 1 month ago
Tackling the common example first :-
I created a new test with the following code
let test = "a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g"
let parsed = TinyCSV.Coder().decode(text: test, delimiter: ";")
Swift.print(parsed.records)
This prints out
[["a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g"]]
which seems to be working as expected? A single row of 5 cells?
@smhk Okay. I think this is happening because the string you've provided is not valid CSV
"ABC",123,123,"","ABC","",""
;
, this string represents the contents of a single cell."
, which marks the cell as a quoted cell.ABC
.Hence, the cell content is ABC
as per the CSV spec.
The CSV spec (as it stands) is very ambiguous and unfortunately because of this different tools generate csv that don't really match. The code takes the stance that everything outside of a quoted cell is discarded - the spec provides no information as how to handle this.
In your example, maybe you could split the text on the ;
character first using Swift's string routines then use the CSV parser with the comma separated sub-cells?
For example :-
let chunks = test.split(separator: ";")
chunks.forEach { cellText in
let parsed = TinyCSV.Coder().decode(text: String(cellText), delimiter: .comma)
Swift.print(parsed.records)
}
This produces :-
[["ABC", "123", "123", "", "ABC", "", ""]]
What do you think?
(I've added tests to test this scenario)
Of course! You're brilliant !! I didn't think about it.
between semicolons (or any divider), this is no good:
"ABC;;;",123,123,"","ABC;;;","","" ; "ABC",123,123,"","ABC","","" ; "ABC",123,123,"","ABC","",""
it would have to be
" "ABC;;;",123,123,"","ABC;;;","","" " ; and so on
ie the overall substring would have to be quoted
(You know, the simple workaround you mention of splitting manually on ; is no good, as it would totally miss parsing quoted strings etc.)
Brilliant! Great thinking!
Yeah that format beyond the scope of the CSV spec unfortunately.
it would have to be
" "ABC;;;",123,123,"","ABC;;;","","" " ; and so on
ie the overall substring would have to be quoted
It's not that easy unfortunately. The first character is a quote, and then 2 characters later is another quote, which would close the quoted field. You'd have to CSV quote each embedded quote which would be painful to be sure.
like " ""ABC;;;"",123,123,"""",""ABC;;;"","""","""" "
But even then, the spec is completely unclear about handling some of these odd encodings
@smhk Actually, if you were to escape both the embedded quotes and semicolons ;
it produces results somewhat like what you're wanting?
let text = #" \"ABC\;\;\;\",123,123,\"\",\"ABC\;\;\;\",\"\",\"\" ; \"ABC\",123,123,\"\",\"ABC\",\"\",\"\" ; \"ABC\",123,123,\"\",\"ABC\",\"\",\"\" "#
let parsed = TinyCSV.Coder().decode(text: text, delimiter: .semicolon, fieldEscapeCharacter: "\\")
produces
[["\"ABC;;;\",123,123,\"\",\"ABC;;;\",\"\",\"\" ", "\"ABC\",123,123,\"\",\"ABC\",\"\",\"\" ", "\"ABC\",123,123,\"\",\"ABC\",\"\",\"\" "]]
Which is somewhat what you're expecting maybe?
Then, you can parse the first cell like
let parsed2 = TinyCSV.Coder().decode(text: parsed.records[0][0], fieldEscapeCharacter: "\\")
Swift.print(parsed2.records)
which produces
[["ABC;;;", "123", "123", "", "ABC;;;", "", ""]]
Indeed, great point !
@dagronf - my God. I just realized that so long as the subgroups are perfectly escaped, you can just use literal csv (with a comma) for such nested issues!
Hence,
" A ", 42, "dog", " B ", 666, "John"
The long strings A and B could in fact be csv strings, so long as they are perfectly escaped
I just never thought of that ...
There is no need (and no difference) using a different divider (say, semicolon) as you nest upwards.
Astonishing!
result
result should just be the whole input string, since it has no splits.
A common type of single line csv is ...
a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g
one splits on ; (perhaps later on spltting on ,)
The above splitting on ; should result in six of the "a,b,c,d,e,f,g"
In any event, splitting on semicolon is not working. It sometimes fails and returns count 1, but, anyways gives only the first string or item seemingly at a comma break.
I could not immediately find the problem !