ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.15k stars 440 forks source link

Does not parse multiple lines in a column separated with '\n' #2

Closed niranjan92 closed 8 years ago

niranjan92 commented 9 years ago

Consider following input:

"Id","Title","Body","Tags"^M 
"1","How to check if an uploaded file is an image without mime type?","<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem i    s that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>

<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>
","php image-processing file-upload upload mime-types"^M 

Here the Body column consists of HTML text with \n as a separator.The newline is denoted with \r\n or ^M. The attached file highlights occurance of all newline characters in the text. I can mail you the sample file in case there is any confusion.Unable to format it correctly using markdown. The data is actually stackoverflow tag prediction data from kaggle which is available openly. csv parser_newline characters

The parser throws following error:

what():  Escaped string was not closed in line 2 in file "../data/pristine/Train_100.txt".
Aborted (core dumped)

I temporarily fixed this issue by a small hack in next_line() function line number 272.

    while(buffer[line_end] != '\r' && line_end != data _end){
        ++line_end;
    }
    ++line_end;//on assumption that \r is followed by \n

I am searching for \r instead of \n. This works for my case but can we have something more robust that handles such input data.

PS: I am new to github as well and this is my first issue

niranjan92 commented 9 years ago

Any updates on this issue?

Regards,

ben-strasser commented 9 years ago

Hi,

this problem already appeared previously and I decided to not fix it at that time. The reasons were:

If you know of a good way to support such ill-formed CSV files, then I can add the change. If not, then this problem will stay.

Best Regards, Ben Strasser

On Sat, 06 Jun 2015 12:06:31 -0700 Niranjan Godbole notifications@github.com wrote:

Any updates on this issue?

Regards,


Reply to this email directly or view it on GitHub: https://github.com/ben-strasser/fast-cpp-csv-parser/issues/2#issuecomment-109638105