apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.14k stars 3.45k forks source link

[C++] CSV add convert option to attempt 32bit number inferences #28740

Open asfimport opened 3 years ago

asfimport commented 3 years ago

When types are being inferred by CSV the numbers are always 64 bit. For large data sets it could be better to use 32 bit types to save over all memory. To do this it would be useful to add an option to ConvertOptions to try 32 bit numbers before 64 bit. By default this option would be disabled.

Reporter: Nate Clark / @n3world

Related issues:

Note: This issue was originally created as ARROW-13028. Please see the migration documentation for further details.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: I'm unsure how much flexibility we want to add to CSV type inference. You can of course pass column types explicitly if you want to optimize memory footprint.

@nealrichardson @jonkeane What do you think?

asfimport commented 3 years ago

Jonathan Keane / @jonkeane: Yeah, I tend to agree that if one needs to / wants to manage that conversion explicit column types is the way to go (and that interface has the benefit of also allowing one to control other types of other columns).

This is an empirical question (and almost certainly vary by the data), but what would a miss look like performance wise for trying 32bit and then having to change to 64bit after the fact? That would involve some computation, correct? Or can we do that conversion for free / without rewriting the representation?

asfimport commented 3 years ago

Weston Pace / @westonpace: The problem is that the miss may not be detected until some # of blocks have been processed.  The file-based CSV reader handles this by going backwards through all the already-processed blocks and upcasting to the looser type.  So it can be a non-trivial performance hit.  However, the streaming CSV (used by the datasets API) isn't so lenient.  It infers type based on the first block (default 1MB) of data alone.  The complexity of doing otherwise is pretty significant.  I think could cause an issue here.  If the large >32 bit value doesn't happen until after the first block you will get parsing errors.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: There would be a conversion from int32 to int64 indeed. The conversion is cheap compared to the actual CSV parsing, so should be a minor concern, though.

asfimport commented 3 years ago

Nate Clark / @n3world: Ideally one would pass the column types if they are known but for my use case I am using the type inference of the reader to know what the types of the columns are. When relying on the reader to get the types the only way to get 32 bit values would be to re-parse the csv forcing the type to a 32bit value and if it isn't a 32 bit value it will fail.

 

It is true that if one of the later blocks did have a 64bit number that would cause a parsing error but the same would be true if the column was inferred as int but it was in fact a float or the column is empty 40% of the time and the first block happens to not have data for the column. This is more of a limitation that the schema is determined by the first block and cannot change after that.

 

One of the reasons that the default is to not try 32bit values is to avoid the potential parse errors on subsequent blocks so this should only really be used if the caller knows all numeric columns can be represented in 32 bit or can handle the parse error.

asfimport commented 3 years ago

Nate Clark / @n3world: I was already working on this and discovered something which might be a problem with my idea. It looks like any numeric value can be parsed as a float even if the precision of the string is too much for a float. Because of this the float parse will succeed even if it is better to parse it as a double.

Is this by design or accident?

asfimport commented 3 years ago

Nate Clark / @n3world: Is there any interest in adding this for at least ints? If not I can just close this ticket.

asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Sorry for not answering earlier:

Is this by design or accident?

I would say by accident, but I'm not sure what you mean with "the precision of the string is too much for a float". Strictly speaking, some very short decimal numbers are not exactly representable in binary floating-point, for example "0.3". Should we reject them?

Is there any interest in adding this for at least ints?

Potentially. @nealrichardson @jonkeane what do you think?

asfimport commented 3 years ago

Nate Clark / @n3world:

Is this by design or accident?

I would say by accident, but I'm not sure what you mean with "the precision of the string is too much for a float". Strictly speaking, some very short decimal numbers are not exactly representable in binary floating-point, for example "0.3". Should we reject them?

I was thinking something like 3.78946546156984798497501e10 can be better represented as double than a float. But as you point out there are values which cannot be fully represented in either, so there might not be a good way to detect when double should be used instead of float.

asfimport commented 2 years ago

Eduardo Ponce / @edponce: I think that having CSV infer to largest type is more robust/safe and use explicit column types for other conversions. If inference is set to be from smallest to largest, then where does these decisions end? Do we infer first as signed or unsigned integers? Int8 vs. int32, etc? Half-float vs float vs double? We can definitely decide to simply try signed int32 and float as the smallest integral type, but it stills feels a bit opinionated.

asfimport commented 2 years ago

Nate Clark / @n3world: I agree that largest type could be considered safest especially for floating point. In theory it could start at int8 and work from there is any interest in that, but signed vs unsigned is probably not as beneficial. For floating point the detection is more difficult since it is already considered an imprecise format so parsers will force values to fit to the size and detection of double vs float would have to be done outside the parser. I did put out the linked MR for int32 detection so that you can see at least that implemented.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: I agree with Eduardo that it feels a bit opinionated. If we want to allow users to influence integer inference, at least a more general setting should probably be exposed than a simple boolean option to try 32-bit inference.

@nealrichardson @ianmcook what do you think?

asfimport commented 2 years ago

Nate Clark / @n3world: @pitrou what would you envision as a more generic setting for influencing the type inference? Something like an enum to indicate only infer 64bit, attempt 32bit int, force 32bit float?

asfimport commented 2 years ago

Nate Clark / @n3world: @pitrou is it worth keeping this ticket open to discuss this further or should this be closed because there is no interest in implementing this behavior?

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Well, it's probably worth keeping open for now. But the solution should revolve around a more future-proof setting than the proposed boolean setting.

asfimport commented 1 year ago

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.