ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.15k stars 440 forks source link

Fails to parse string with four consecutive quotes! #3

Closed niranjan92 closed 8 years ago

niranjan92 commented 9 years ago

Fails to parse following input:

"Id","Title","Body","Tags"^M 
"1","How to check if an uploaded file is an image without mime type?","<p><img src="http://i.stack.imgur.com/bA7Tz.jpg" alt=""""></p>
","php image-processing file-upload upload mime-types"^M 

The output for Body column should be

<p><img src="http://i.stack.imgur.com/bA7Tz.jpg" alt=""></p>

But the ouput is:

<p><img src="http://i.stack.imgur.com/bA7Tz.jpg" alt="></p>
gcflymoto commented 9 years ago

how do you expect that to be legal? shouldn't quotes be escaped?

niranjan92 commented 9 years ago

Its not an outright error, But the library doesn't support that functionality.

I using data released by stackoverflow.com for a tag prediction competition on Kaggle. I think Excel to csv conversion results in such data. http://stackoverflow.com/questions/25064422/excel-to-csv-to-txt-cells-with-double-quotes

As per wiki: http://en.wikipedia.org/wiki/Comma-separated_values

Embedded double quote characters may then be represented by a pair of consecutive double quotes,[9] or by prefixing an escape character such as a backslash (for example in Sybase Central).
ben-strasser commented 9 years ago

Hi,

thanks for the bug report. I can confirm the problem. The erroneous code is in line 601.

char_out = col_begin; for(char_in = col_begin; in!=col_end; ++in){ if(in == quote && (in+1) == quote){ continue; } out = in; ++out; }

When unescaping the string the code drops every " that is followed by another ". This is correct in the case of "bla""foo""bar" but fails in the case of "foo""""bar".

Does the following code work for you? (I currently do not have the time to test.)

char_out = col_begin; for(char_in = col_begin; in!=col_end; ++in){ if(in == quote && (in+1) == quote){ ++in; } out = in; ++out; }

Best Regards, Ben Strasser

On 06/03/2015 08:30 PM, Niranjan Godbole wrote:

Fails to parse following input:

"Id","Title","Body","Tags"^M "1","How to check if an uploaded file is an image without mime type?","

<img src="http://i.stack.imgur.com/bA7Tz.jpg" alt="""">

","php image-processing file-upload upload mime-types"^M

The output for Body column should be

But the ouput is:

<img src="http://i.stack.imgur.com/bA7Tz.jpg" alt=">

— Reply to this email directly or view it on GitHub https://github.com/ben-strasser/fast-cpp-csv-parser/issues/3.

niranjan92 commented 9 years ago

yes, the given fix works! Thanks. :)