apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[Python][CSV] CSV reader performs unsafe type conversion #32171

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Hi, I've noticed that although pa.scalar and pa.array behave correctly when given the largest possible (uint64) value (i.e. they fail correctly when trying to cast to float e.g.), the CSV reader happily converts strings representing uint64 values to float (see example below). Is this intended? Would it be possible to have a safe-conversion-only option?

The problem is that at the moment the only safe option to read a CSV whose types are not known in advance is to read without any conversion (string only) and perform the type inference oneself.

It would be ok if Uint64 types couldn't be inferred, as long as the corresponding columns aren't coerced in a destructive manner to float. I.e., if they were left as string columns, one could then implement a custom conversion, while still benefiting from the correct and automatic conversion of the remaining columns.

 

The following correctly rejects the float type for uint64 values:


import pyarrow as pa

uint64_max = 18_446_744_073_709_551_615

type_ = pa.uint64()
uint64_scalar = pa.scalar(uint64_max, type=type_)
uint64_array = pa.array([uint64_max], type=type_)

try:
    f = pa.scalar(uint64_max, type=pa.float64())
except Exception as exc:
    print(exc)
    
try:
    f = pa.scalar(uint64_max // 2, type=pa.float64())
except Exception as exc:
    print(exc) 

>> PyLong is too large to fit int64
>> Integer value 9223372036854775807 is outside of the range exactly representable by a IEEE 754 double precision value

The CSV reader, on the other hand, doesn't infer UInt64 types (which is fine, as documented here https://arrow.apache.org/docs/cpp/csv.html#data-types),)  but does coerce values to float which shouldn't be coercable according to above examples:


import io

csv = "int64,uint64\n0,0\n4294967295,18446744073709551615"
tbl = pa.csv.read_csv(io.BytesIO(csv.encode("utf-8")))

print(tbl.schema)
print(tbl.column("uint64")[1] == uint64_scalar)
print(tbl.column("uint64")[1].cast(pa.uint64())) 

int64: int64
uint64: double

False
0

Reporter: Thomas Buhrmann / @buhrmann

Related issues:

Note: This issue was originally created as ARROW-16843. Please see the migration documentation for further details.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Note that's not really the same situation: in the first case, you are converting between number types, where you presumably expect the conversion to be exact. In the second case, you are converting from string to number, where it's not unreasonable to accept some possible precision loss.

It's also likely that raising an error would break a non-tiny number of real-world CSV reading use cases.

asfimport commented 2 years ago

Thomas Buhrmann / @buhrmann: You're right, this also performs destructive conversion:


pa.scalar("18446744073709551615").cast(pa.float64()) 

 >> <pyarrow.DoubleScalar: 1.8446744073709552e+19>

Which is why I think it would be good to have an option to not perform certain conversions automatically if they have the potential to be destructive (in the sense that one cannot cast back to string or another type without loss of information), even if the default may be destructive. E.g. it is quite common to have ID columns in the uint64 range, which at the moment cannot be read using the CSV reader (without disabling all type inference).

Another possibility would be to pass a list of inferrable types (so one could exclude float64), in addition to the explicit column_types parameter.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou:

Another possibility would be to pass a list of inferrable types (so one could exclude float64), in addition to the explicit column_types parameter.

Yes, I think this would be the better solution and would address more use cases.

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: An actual list of inferrable types wouldn't exactly work, by the way, because datatypes are not the inference granularity.

asfimport commented 2 years ago

Thomas Buhrmann / @buhrmann: Or even expose the type inference itself in some way, so one could simply read all columns as strings and then use the underlying type inference on a column by column basis, using additional custom logic. I'm currently creating an additional inference layer, e.g., that also infers list types from string columns, timestamps with non-iso formats, downcasts ints to the smallest possible type etc. (the uint64 case is the only "problem" I had so far fwiw..)

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: Yes, making the inference abstract and overridable would be another possible way. Needs someone to look at the current inference internals and devise an API to make it overridable.

asfimport commented 2 years ago

Thomas Buhrmann / @buhrmann: If the information about the types being tried in order here is correct, wouldn't it make sense to simply introduce the UInt64 type either directly after Int64 or before Float64? Not sure at all about the underlying implementation, but it sounds kind of trivial, and would be a nice gain in "correctness" (e.g. being able to read Twitter datasets without mangling tweet or author IDs).

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: There are two reasons I think that is a bad idea:

asfimport commented 2 years ago

Thomas Buhrmann / @buhrmann: Yes, fair enough, datasets of multiple CSVs may be awkward. As to how common it is to have 64-bit numbers, I guess the most common use case would be IDs. But really those are often treated best as strings anyways, i.e., they're not usually used to calculate with. So as long as there's a way (configuration) for them not to be converted to floats that would probably be sufficient. (In fact, even in the use case I mentioned earlier, Twitter IDs, the official API recommends to treat them as strings: https://developer.twitter.com/en/docs/twitter-ids).

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: You can always disable inference for a particular column by setting a type in ConvertOptions::column_types

asfimport commented 2 years ago

Antoine Pitrou / @pitrou: cc @wjones127 for possibly researching an API allowing to customize CSV type inference.