Inist-CNRS / node-csv-string

CSV Strings & Streams for Javascript since 2012
Other
86 stars 18 forks source link

Fails to detect delimiter #45

Open ghost opened 4 years ago

ghost commented 4 years ago

I'm trying to use the detect function to determine the delimiter used for data which can be either tab or comma delimited, however it always returns the delimiter as , even if there are no commas in the data.

touv commented 4 years ago

As you can see, the "detect" function is pretty basic, https://github.com/Inist-CNRS/node-csv-string/blob/master/src/CSV.ts#L57-L65

comma is the default value !

ghost commented 4 years ago

comma is the default value !

I understand that, however shouldn't it change if it detects a different delimiter (tab in my case)?

pbrunisholz commented 4 years ago

Hello, I noticed that problem too.

I tried to detect the best separator in that test string using the detect function : a,b|c;d In a first time I wasn't expecting something, I was only waiting to see the result in order to understand the logic behind the selection of the "best" separator. The result was a comma , so I guessed it was selected as default because there was too many different separators.

Then I made another test : "a,b";c;d This time I expected a semi-colon ; as result since I escaped the content containing the comma with double-quotes but I was disapointed to get a comma , as result once more. So I guessed it was using the first character matching with the array of declared delimiters.

So I made a last test : "a;b"|c|d Since I removed the comma and put the semi-colon between double-quotes I was expecting the pipe | as result but I got the semi-colon ; which confirms my second hypothesis now.

Since the issue has been notified, do you plan to fix that or do you consider it as a correct behaviour ? If you consider it as correct, can you please explain why ?

Thank you

EDIT : I checked the detect function and indeed it was like I expected as second hypothesis.

The function defines the "best" delimiter as the first character found in the input which is also included in the array of accepted separators without considering the escaped content.

So for now I created my own detect function which use the same array of allowed delimiter except that I do not consider matching separators within a string escaped either by double or single quotes and it's not the first encounter but the most recurring one that is returned. Since it can quickly become a time consuming process on large files, I only read the first row (generally the headers one) to detect, not the best, but the most probable delimiter used in the file.

Let me know if you're interested in my function, I'd be glad to share it with you. Have a good day !

hossam-magdy commented 3 years ago

@pbrunisholz Thank you for spending time in this.

Your suggestion of the detect function seems interesting to me!

As you already implemented it, why don't you create a pull request with the changes? if you like! ... perhaps saving some time. Also it might be easier to discuss in a PR.