ben-strasser / fast-cpp-csv-parser

fast-cpp-csv-parser
BSD 3-Clause "New" or "Revised" License
2.15k stars 440 forks source link

How to parse less columns than in the line when using `set_header()` #18

Closed adishavit closed 8 years ago

adishavit commented 8 years ago

I have a line with, say 15 columns, but I'm interested only in the first 5.
However, I don't have a header so I use set_header() instead of read_header(). But set_header() does not have an ignore_policy.

How can I parse just the first 5 columns, without a bunch of dummies?

pallavagarwal07 commented 8 years ago

+1 @ben-strasser I seem to need the opposite of this. If set_header specifies more columns (say n) and the csv file has m columns, then the last m-n columns should get default value as in ignore_missing_columns

ben-strasser commented 8 years ago

Hi,

I originally considered adding a more general set_header however decided not to do it.

If you do not know the number of columns in the file then how do you know which you need? You might say that you need the first x columns. But why not the last x? or the every other column? When writing the parser you cannot know how someone will modify the file format. Where will he add his new column? Will he remove columns? All these question can be handled transparently and missing columns detected when the CSV file has a header. If it does not then this is not possible. I therefore argue that if the CSV format changes and does not have a header that in any case the programmer will have to check manually whether the parsing code still works. Having a set_header with an ignore_policy that only reads the first x parameters is therefore a bug in the making.

If you know the CSV format and the number of columns but only want to read some columns then you can use dummy char* variables. These pointers point directly into the memory buffer. There is therefore nearly no overhead associated. You can argue that for this usecase the interface is ugly and you are right. However, I think that this usecase is sufficiently rare that we can live with the current inferface, especially I do not see how to design an interface that is both flexible and elegant. Using a complicated interface is no prettier than the current situation.

Further having an ugly interface for CSV files without header has its use: It pushes people towards adding headers, which will help them down the line when the CSV file format is updated.

Best Regards Ben Strasser

adishavit commented 8 years ago

I get what you're saying. For argument's sake, consider an C/C++ function with default values. You can specify only the first k<n arguments and the rest get the defaults. There is no syntactic option for using just the last k or interleaving. I could argue the same here. If you want just the first k columns, it is a valid use case. Otherwise, use dummy variables. I ended up using dummies too.

Sent from my iPhone

On 25 May 2016, at 08:59, ben-strasser notifications@github.com wrote:

Hi,

I originally considered adding a more general set_header however decided not to do it.

If you do not know the number of columns in the file then how do you know which you need? You might say that you need the first x columns. But why not the last x? or the every other column? When writing the parser you cannot know how someone will modify the file format. Where will he add his new column? Will he remove columns? All these question can be handled transparently and missing columns detected when the CSV file has a header. If it does not then this is not possible. I therefore argue that if the CSV format changes and does not have a header that in any case the programmer will have to check manually whether the parsing code still works. Having a set_header with an ignore_policy that only reads the first x parameters is therefore a bug in the making.

If you know the CSV format and the number of columns but only want to read some columns then you can use dummy char* variables. These pointers point directly into the memory buffer. There is therefore nearly no overhead associated. You can argue that for this usecase the interface is ugly and you are right. However, I think that this usecase is sufficiently rare that we can live with the current inferface, especially I do not see how to design an interface that is both flexible and elegant. Using a complicated interface is no prettier than the current situation.

Further having an ugly interface for CSV files without header has its use: It pushes people towards adding headers, which will help them down the line when the CSV file format is updated.

Best Regards Ben Strasser

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub