dolthub / doltpy

A Python API for Dolt
Apache License 2.0
55 stars 13 forks source link

Use SQL data types when reading into pandas #179

Open alkamid opened 2 years ago

alkamid commented 2 years ago

This is a feature request. Currently, read_pandas and read_pandas_sql use SQL->CSV->Pandas conversion behind the scenes. It would be great if the resulting DataFrame could inherit some info from the SQL table schema, such as data types.

Minimal example:

dolt_database> CREATE TABLE mytable (
            -> id int NOT NULL,
            -> is_human bool(1),
            -> PRIMARY KEY (id)
            -> );
dolt_database> INSERT INTO mytable
            -> VALUES (1, 1);
Query OK, 1 row affected
dolt_database> INSERT INTO mytable VALUES (2, 0);
Query OK, 1 row affected
dolt_database> SELECT * FROM mytable;
+----+----------+
| id | is_human |
+----+----------+
| 1  | 1        |
| 2  | 0        |
+----+----------+
In [5]: read_pandas_sql(dolt, "SELECT * from mytable")
Out[5]:
  id is_human
0  1        1
1  2        0

In [7]: df.dtypes
Out[7]:
id          object
is_human    object
dtype: object

If this feature is implemented, the result would be:

In [11]: df.dtypes
Out[11]:
id          int64
is_human     bool
dtype: object

I appreciate that this is a time consuming change. I believe though that many use cases would benefit. In my case, I often use "read table via dolt, update dataframe in pandas, write updated table via dolt" workflow. Having more intelligent data type resolution would help me avoid different dtypes within a column. It would also reduce queries like:

df[df["my_bool_column"] == "1"]

Instead, I'd use

df[df["my_bool_column"]]
max-hoffman commented 2 years ago

Hi @alkamid, we have a greater variety of export types since originally shipping Doltpy, including parquet. I believe the small fix below should add some types and None's back to your DataFrames. Strings still appear to be cast as objects, and bools are parsed as int64's at the moment:

tmp3> create table mytable (id int primary key, name varchar(20), comment varchar(20), is_human bool(1), dub double);
tmp3> insert into mytable values (1, 'Alice', 'A', 1, 1.0), (2, 'Bob', '', 0, 3.4), (3, 'Charlie', NULL, 0, 1.1);Query OK, 3 rows affected
tmp3> select * from mytable;
+----+---------+---------+----------+-----+
| id | name    | comment | is_human | dub |
+----+---------+---------+----------+-----+
| 1  | Alice   | A       | 1        | 1   |
| 2  | Bob     |         | 0        | 3.4 |
| 3  | Charlie | NULL    | 0        | 1.1 |
+----+---------+---------+----------+-----+
> dolt table export mytable mytable.parquet
Successfully exported data.
In [1]: import pandas as pd
^[[A^[[A
In [2]: df = pd.read_parquet("mytable.parquet")

In [3]: df
Out[3]:
   id     name comment  is_human  dub
0   1    Alice       A         1  1.0
1   2      Bob                 0  3.4
2   3  Charlie    None         0  1.1

In [4]: df.dtypes
Out[4]:
id            int64
name         object
comment      object
is_human      int64
dub         float64
dtype: object

Is this level of detail useful for you? The long tail of python / SQL type conversions are harder to offer in the short term.

alkamid commented 2 years ago

Thanks for the reply @max-hoffman. If I understood your suggestion correctly, a replacement for read_pandas using parquet instead of csv would be sth like:

from pathlib import Path
from doltpy.cli.dolt import DoltT
import pandas as pd
from tempfile import TemporaryDirectory

def read_pandas_new(dolt: DoltT, table: str) -> pd.DataFrame:
    with TemporaryDirectory() as tmpdir:
        fpath = Path(tmpdir) / "temporary_dolt_export.parquet"
        dolt.table_export(table, filename=str(fpath))
        df = pd.read_parquet(fpath)
    return df

This indeed seems objectively better to me in terms of data correspondence between MySQL and the DataFrame. However the tradeoff are the two additional dependencies (pyarrow, fastparquet). Do you think the CSV-based method should be the default because it fits more use cases? If so, would you consider parameterising read_pandas to be able to use different intermediate representations of SQL tables?

max-hoffman commented 2 years ago

Good points about tradeoffs @alkamid, and thank you for the suggestions. In our case, I think we would lean towards including one of the two dependencies in doltpy plus the option for serialization IR.

Our two broad buckets of use are 1) raw SQL, like @timsehn suggested in Discord, or 2) using dolt CLI calls as a shim into existing data science libraries. We are investing heavily in 1), because that will be become increasingly useful over time. We have less control over 2) in the near term; doltpy will lean on 3rd party dependencies (like pyarrow) while we encourage people to take the SQL route or use the depenency-less doltcli.

fedderw commented 2 years ago

Although I don't have the technical knowledge to follow what @max-hoffman just said, I do second this issue.

For hospital-price-transparency-v3, CMS' certification numbering system (hospitals.cms_certification_num ) uses leading zeros, which had to be corrected for with zfill() or something like that.

I found the name dolt.read_pandas() to really be inaccurate for what I expected from the method.

I would have assumed that the correct typing was built in, not that it was just a wrapper for pd.read_csv() without the args I needed

max-hoffman commented 2 years ago

@fedderw The next doltpy release will let you specify read_pandas(dolt, table, asof, fmt="parquet"). I ran a small test and it seems like the same fix should maintain zero padding for char types.

fedderw commented 2 years ago

This works great! image