ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.12k stars 439 forks source link

Passing a vector of columns to be populated instead of names #31

Closed rapoth closed 7 years ago

rapoth commented 7 years ago

Hi,

Instead of explicitly typing out variables into which you'd parse the CSV file like this:

std::string id; std::string speed; std::string value; _reader->read_row(id, speed, value);

Is it possible to instead accept a vector containing columns with specific types to do this? Something like the following:

auto cols = new Columns{ ... three string cols ... };
_reader->read_row(cols);

will then populate cols[0], cols[1] and cols[2] as needed. Any suggestions on how this can be done?

helmingstay commented 7 years ago

This is (mostly) possible with a variadic template. First off, you must use an array, since the size of a vector is not known at compile time. Second, you need a helper template to take that array and "unroll" it into args.

It's a little tricky if you're new to TMP, but it's a common enough construct.

Edit - I've pasted in some snippets of my working code. It's far from a MWE, but it will (hopefully) provide you with a starting place.

// http://stackoverflow.com/questions/16834851/passing-stdarray-as-arguments-of-template-variadic-function // magic: generate integer sequence as function args, pass to variadic template namespace indices_t { template struct seq { }; template<int N, int... Is> struct gen_seq : gen_seq<N - 1, N - 1, Is...> { }; template struct gen_seq<0, Is...> : seq { }; }

// example calling function template // the analogy to a "column name array" here is row_text template void _read_row(reader_t & in_csv, indices_t::seq) { // ... while( in_csv.read_row( row_text[Is]...) ) { // do stuff } }

// call example - 
// the effect is that _read_row is indices_t generates one arg, a numeric index, for each element of the array 
    _read_row(in_csv, indices_t::gen_seq<Ndat>());

hth, -xian

rapoth commented 7 years ago

@helmingstay Thanks for your time Xian! This is a very interesting solution! However, wouldn't this require modifying the csv.h? In addition, I have two questions:

  1. I am the new function will only read the csv values as text (and does not deal with casting into types). Is this correct?
  2. Instead of this approach (which requires modifying csv.h), why not add one more signature like the following:

bool read_row(std::vector<std::string>& columns)

Would be interesting to see whether @ben-strasser has any plans of supporting this. If so, maybe @helmingstay can submit a pull request? I would highly prefer something that does not require me to maintain my own fork of the codebase.

helmingstay commented 7 years ago

Just to reiterate, the code I added wasn't meant as a complete example, or as a solution to your particular issue. It was simply to A) let you know that your request was possible and B) point you in the right direction.

To answer your questions:

A. " However, wouldn't this require modifying the csv.h?" I've simply added a few alias templates and helper classes that set policies, and mold the interface in csv.h to my liking. For example, I have a header with the following:

// convenience policy definition, trim spaces, comma sep, comments=#
template<size_t ncol>
using csv_t = io::CSVReader<ncol, io::trim_chars<' '>, io::no_quote_escape<','>, io::throw_on_overflow, io::single_line_comment<'#'>  >; 
  1. In my case, I'm reading everything as text and then manually parsing. As @ben-strasser notes in the README, "The builtin number parsers are pure convenience. If you need a slightly different syntax then use char* and do the parsing yourself."
  2. It is NOT POSSIBLE to determine the dimension of a vector at runtime. Thus, this code will not (and cannot be made to) compile. You could do this with a std::array containing char* (though be aware that, for the exact same reason, an array of std::strings is not possible).

I suggest spending a few days reading up on TMP and variadic templates in cpp, and then working through a few examples. Some of the questions that you're asking will be quickly clarified with some practice :)

ben-strasser commented 7 years ago

Is it possible to instead accept a vector containing columns with specific types to do this? Something like the following:

auto cols = new Columns{ ... three string cols ... };
_reader->read_row(cols);

will then populate cols[0], cols[1] and cols[2] as needed. Any suggestions on how this can be done?

You can do this:

vector<string>vec(3);
in.read_row(vec[0], vec[1], vec[2]);

This will get the data into a vector of strings without you having to declare a variable per column. Sounds to me like what you were asking for.

Instead of this approach (which requires modifying csv.h), why not add one more signature like the following: bool read_row(std::vector<std::string>& columns)

CSVReader takes the number of columns as compile-time constant. A vector has a dynamic size. What should happen if the vector's size and the number of columns do not coincide? There are many options: assert fails, exception thrown, vector resized, segfault, stack trashing. No option is intuitive. On the other hand the simple

in.read_row(vec[0], vec[1], vec[2]);

makes it absolutely clear that it is the responsibility of the caller to make the vector the right size. Further, not having an additional method makes the interface thinner. Less code -> less bugs, Smaller interface -> easier to learn.

@helmingstay Be aware that the docs says to read as char* and not as std::string. From a performance point of view, there is a significant difference between those two.

helmingstay commented 7 years ago

@ben-strasser "Be aware that the docs says to read as char* and not as std::string. From a performance point of view, there is a significant difference between those two."

Yes, thank you :) In my case I actually do use std::string, mainly out of laziness, but I was careful to remove any mention of it in my examples (and explicitly point to your README comment).

sshivaji commented 7 years ago

I respectfully think its quite useful to have dynamic fill for values in the in.read_row api. A common usecase is to process very different CSV files for analysis. How would one know in advance the number of columns and formulate that in the source code then? Is there another way to populate a variadic function?

helmingstay commented 7 years ago

On Wed, Dec 14, 2016 at 6:32 PM, Shivkumar Shivaji <notifications@github.com

wrote:

How would one know in advance the number of columns and formulate that in the source code then? Is there another way to populate a variadic function?

Be careful to distinguish between a variadic function ( http://en.cppreference.com/w/cpp/utility/variadic) and a variadic function template (used here, see http://en.cppreference.com/w/cpp/language/parameter_pack).

While I'm not Ben, the use of templates here is a fundamental design decision. And a template must be fully specified at compile time.

Maybe you could specify an upper bound of the number of columns, and then generate all possible instantiations. Maybe you could store these definitions in an std::array, and then use some sort of typedecl to select the right type at compile time? That doesn't solve the column name problem.

All told, it seems like this particular library isn't a very good fit for your use-case? best, Christian

-- A man, a plan, a cat, a ham, a yak, a yam, a hat, a canal – Panama! http://www.x14n.org

sshivaji commented 7 years ago

The speed of this library is very good. The limitation of the static column typing I think can be solved. If you have a branch or commit I can test, I will be more than happy. For now, I will look at your proposal and sample code above and see if I can fix it.

Thanks for pointing out the difference between a variadic template and a variadic function!

ben-strasser commented 7 years ago

Hi,

thanks for the feedback. :)

This has been asked several times. There are several important features that currently exist, that would conflict in an, as far as I can tell, non-resolvable manner. Especially the implicit column reordering is problematic. Look at the older issues for a more detailed rational.

If all you want is the speed, then you can also use LineReader and populate a vector<char*> by splitting the line at commas and replacing them by a 0-byte. For example:

cpp LineReader in(file_name); while(char*line = in.next_line()){ vector<char*>row; row.push_back(line); while(*line != '\0'){ if(*line == ','){ *line = '\0'; row.push_back(line+1); } ++line; } // do something with row } Warning: untested code.

Best Regards Ben Strasser