Closed kirnhans closed 7 years ago
Hi,
thanks for the bug report but I do not understand what you mean.
ignore_missing_column is a constant and is per definition 2 and not 3. It is supposed to be used as bitflag. The ignore_policy & ::io::ignore_missing_column condition checks whether the corresponding bit of ignore_policy is set.
Replacing the & by a && is not useful here, as "x && 2" is the same as "x" which is the same as "x != 0" which is definitively not the intended condition.
Best Regards Ben Strasser
So I tried to use your parser but it either threw a missing column or extra column error, though I was using it correctly. When I changed the type of and used, it was able to run, but did not actually parse the row properly when I called read_row - it did not overwrite the variables I passed into the function. I'm not sure how to use this library at all.
Hi,
without code nor example CSV it is very hard to help you.
Best Regards Ben Strasser
https://raw.githubusercontent.com/kirnhans/15418-project/master/data/cancer/cancer_test.csv is our dataset.
https://github.com/kirnhans/15418-project/blob/master/ParallelRandomForest.cpp#L11 is our code for using your parser.
Right now next_line will return the correct ASCII values, so long as the parser uses logical ands, but read_row does not work.
Hi,
I cannot reproduce the problem. Consider this code:
#include "csv.h"
#include <iostream>
using namespace std;
int main(){
io::CSVReader<1> in("cancer_test.csv");
in.read_header(io::ignore_extra_column, "y");
int y;
cout << "y" << endl;
while(in.read_row(y)){
cout << y << endl;
}
return 0;
}
and the following shell code:
g++ main.cpp -std=c++11 -pthread -o foo
./foo > test.out
sed -E "s/^.*(.)$/\1/" < cancer_test.csv > sed.out
diff test.out sed.out
The diff has no output, i.e., the files are equal, i.e., the last column is correctly selected. The error seems to be in your code.
Best Regards Ben Strasser
Does the code still work when you use "in.set_file_line(index)"?
Sorry about that - it turns out that one of the files wasn't correctly formatted. When I use set_file_line and then read_row, it doesn't read the line which I used as the argument, but the line it was at previously. This is a problem because I want random accesses.
Hi,
When I use set_file_line and then read_row, it doesn't read the line which I used as the argument, but the line it was at previously. This is a problem because I want random accesses.
set_file_line only sets the line used in the error reporting.
If you need random access then read the whole file and store each row in a vector. CSV is not a format that supports an efficient random access. One must always scan the whole file to figure out which line is which.
Best Regards Ben Strasser
The parser could generate a vector
Does this parser support quoted multiline csv cells?
On 4 May 2017 at 17:14, ben-strasser notifications@github.com wrote:
Hi,
When I use set_file_line and then read_row, it doesn't read the line which I used as the argument, but the line it was at previously. This is a problem because I want random accesses.
set_file_line only sets the line used in the error reporting.
If you need random access then read the whole file and store each row in a vector. CSV is not a format that supports an efficient random access. One must always scan the whole file to figure out which line is which.
Best Regards Ben Strasser
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ben-strasser/fast-cpp-csv-parser/issues/40#issuecomment-299134242, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkgSzg__9lKPR0GWB2P4YJrIyT8Ozuqks5r2ZcNgaJpZM4NOZD3 .
Does this parser support quoted multiline csv cells?
No, 1 row = 1 line.
A CSV file with string values that have non-escaped linebreaks is broken and not fixing your files is begging for trouble.
(edited to change "standard" to "RFC" as its just an RFC not an actual standard)
Sorry for the late reply,
The RFC for CSV specifies unescaped line breaks, they just need to be within quotes. https://www.ietf.org/rfc/rfc4180.txt See number 6
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Everyone has a different opinion of what a CSV format should be. It would be nice if everyone just followed one format...
My branch of this project supports the RFC correctly (to my knowledge): https://github.com/paulharris/cppcsv It uses a state machine to parse the format, seems pretty quick. Not sure if its as fast as your library though, I haven't had a chance to compare.
Best regards, Paul
https://github.com/ben-strasser/fast-cpp-csv-parser/blob/master/csv.h#L900 This is a bitwise and instead of a logical and. This causes parsing to be incorrect for some cases, for example when ignore_policy = 2 and ignore_missing_column = 3. The same thing is true for line 909. To fix it, replace the & with an &&.