dolthub / doltpy

A Python API for Dolt
Apache License 2.0
55 stars 13 forks source link

Convert NULL values to NaN when reading into pandas DataFrames #180

Closed alkamid closed 2 years ago

alkamid commented 2 years ago

This is a feature request (however, the current behaviour can lead to bugs as in the example above).

One consequence of SQL->CSV->DataFrame conversion is that NULL values are converted into empty strings:

dolt_database> SELECT * FROM mytable;
+----+---------+---------+
| id | name    | comment |
+----+---------+---------+
| 1  | Alice   | A       |
| 2  | Bob     |         |
| 3  | Charlie | NULL    |
+----+---------+---------+
In [16]: df = read_pandas_sql(dolt, "SELECT * from mytable")

In [17]: df
Out[17]:
  id     name comment
0  1    Alice       A
1  2      Bob
2  3  Charlie

In [18]: df.at[1, "comment"]
Out[18]: ''

In [19]: df.at[2, "comment"]
Out[19]: ''

This makes it impossible to differentiate between empty cell contents and NULL values.

With this feature implemented, the expected result would be:

In [22]: df
Out[22]:
  id     name comment
0  1    Alice       A
1  2      Bob
2  3  Charlie     NaN
max-hoffman commented 2 years ago

See https://github.com/dolthub/doltpy/issues/179#issuecomment-1142679492

alkamid commented 2 years ago

Let's continue the conversation in #179 as the two issues will potentially share the solution.