dagronf / TinyCSV

A tiny Swift CSV decoder/encoder library, conforming to RFC 4180
MIT License
8 stars 2 forks source link

Found a bug! #3

Open smhk opened 1 month ago

smhk commented 1 month ago
        let __csvParser = TinyCSV.Coder()

        let test = "\"ABC\",123,123,\"\",\"ABC\",\"\",\"\""

        let t3 = __csvParser.decode(text: test, delimiter: ";").records.first ?? []
        print("t3", t3)
        print("t3[0]", t3[0])

result

t3 ["ABC"]
t3[0] ABC

result should just be the whole input string, since it has no splits.

A common type of single line csv is ...

a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g

one splits on ; (perhaps later on spltting on ,)

The above splitting on ; should result in six of the "a,b,c,d,e,f,g"

In any event, splitting on semicolon is not working. It sometimes fails and returns count 1, but, anyways gives only the first string or item seemingly at a comma break.

I could not immediately find the problem !

dagronf commented 1 month ago

Tackling the common example first :-

I created a new test with the following code

let test = "a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g;a,b,c,d,e,f,g"
let parsed = TinyCSV.Coder().decode(text: test, delimiter: ";")
Swift.print(parsed.records)

This prints out

[["a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g", "a,b,c,d,e,f,g"]]

which seems to be working as expected? A single row of 5 cells?

dagronf commented 1 month ago

@smhk Okay. I think this is happening because the string you've provided is not valid CSV

"ABC",123,123,"","ABC","",""

Hence, the cell content is ABC as per the CSV spec.

The CSV spec (as it stands) is very ambiguous and unfortunately because of this different tools generate csv that don't really match. The code takes the stance that everything outside of a quoted cell is discarded - the spec provides no information as how to handle this.

In your example, maybe you could split the text on the ; character first using Swift's string routines then use the CSV parser with the comma separated sub-cells?

For example :-

let chunks = test.split(separator: ";")
chunks.forEach { cellText in
   let parsed = TinyCSV.Coder().decode(text: String(cellText), delimiter: .comma)
   Swift.print(parsed.records)
}

This produces :-

[["ABC", "123", "123", "", "ABC", "", ""]]

What do you think?

(I've added tests to test this scenario)

smhk commented 1 month ago

Of course! You're brilliant !! I didn't think about it.

between semicolons (or any divider), this is no good:

"ABC;;;",123,123,"","ABC;;;","","" ; "ABC",123,123,"","ABC","","" ; "ABC",123,123,"","ABC","",""

it would have to be

" "ABC;;;",123,123,"","ABC;;;","","" " ; and so on

ie the overall substring would have to be quoted

(You know, the simple workaround you mention of splitting manually on ; is no good, as it would totally miss parsing quoted strings etc.)

Brilliant! Great thinking!

dagronf commented 1 month ago

Yeah that format beyond the scope of the CSV spec unfortunately.

it would have to be

" "ABC;;;",123,123,"","ABC;;;","","" " ; and so on

ie the overall substring would have to be quoted

It's not that easy unfortunately. The first character is a quote, and then 2 characters later is another quote, which would close the quoted field. You'd have to CSV quote each embedded quote which would be painful to be sure.

like " ""ABC;;;"",123,123,"""",""ABC;;;"","""","""" "

But even then, the spec is completely unclear about handling some of these odd encodings

dagronf commented 1 month ago

@smhk Actually, if you were to escape both the embedded quotes and semicolons ; it produces results somewhat like what you're wanting?

let text = #" \"ABC\;\;\;\",123,123,\"\",\"ABC\;\;\;\",\"\",\"\" ; \"ABC\",123,123,\"\",\"ABC\",\"\",\"\" ; \"ABC\",123,123,\"\",\"ABC\",\"\",\"\" "#
let parsed = TinyCSV.Coder().decode(text: text, delimiter: .semicolon, fieldEscapeCharacter: "\\")

produces

[["\"ABC;;;\",123,123,\"\",\"ABC;;;\",\"\",\"\" ", "\"ABC\",123,123,\"\",\"ABC\",\"\",\"\" ", "\"ABC\",123,123,\"\",\"ABC\",\"\",\"\" "]]

Which is somewhat what you're expecting maybe?

Then, you can parse the first cell like

let parsed2 = TinyCSV.Coder().decode(text: parsed.records[0][0], fieldEscapeCharacter: "\\")
Swift.print(parsed2.records)

which produces

[["ABC;;;", "123", "123", "", "ABC;;;", "", ""]]
smhk commented 1 month ago

Indeed, great point !

smhk commented 1 month ago

@dagronf - my God. I just realized that so long as the subgroups are perfectly escaped, you can just use literal csv (with a comma) for such nested issues!

Hence,

" A ", 42, "dog", " B ", 666, "John"

The long strings A and B could in fact be csv strings, so long as they are perfectly escaped

I just never thought of that ...

There is no need (and no difference) using a different divider (say, semicolon) as you nest upwards.

Astonishing!