apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.17k stars 1.13k forks source link

Reading CSV file with int column using scientific notation #885

Open TobyRoseman opened 5 years ago

TobyRoseman commented 5 years ago

If you read an int column where some of the value use scientific notation, then the column is inteprited as a float column. If you give an integer type hint, then you get incorrect results (it ignores the exponent).

Using the following as test.csv:

id,name
204472098,foo
2.2E+11,bar
In [1]: import turicreate as tc

In [2]: tc.SFrame.read_csv('/tmp/test.csv')
Finished parsing file /tmp/test.csv
Parsing completed. Parsed 2 lines in 0.029181 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[float,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /tmp/test.csv
Parsing completed. Parsed 2 lines in 0.006446 secs.
Out[2]: 
Columns:
    id  float
    name    str

Rows: 2

Data:
+----------------+------+
|       id       | name |
+----------------+------+
|  204472098.0   | foo  |
| 220000000000.0 | bar  |
+----------------+------+
[2 rows x 2 columns]

In [3]: tc.SFrame.read_csv('/tmp/test.csv', column_type_hints=[int,str])
Finished parsing file /tmp/test.csv
Parsing completed. Parsed 2 lines in 0.005768 secs.
Out[3]: 
Columns:
    id  int
    name    str

Rows: 2

Data:
+-----------+------+
|     id    | name |
+-----------+------+
| 204472098 | foo  |
|     2     | bar  |
+-----------+------+
[2 rows x 2 columns]
znation commented 5 years ago

Workaround: read as float, and convert to int after reading (with astype).