dolthub / doltpy

A Python API for Dolt
Apache License 2.0
55 stars 13 forks source link

Parquet export IR #183

Closed max-hoffman closed 2 years ago

max-hoffman commented 2 years ago

Exporting data through a CSV intermediary is subject to loss of specificity and type info. This is particularly noticable for read_pandas, where the resulting dataframe has every column of type object and NULLs are indistinguishable from zero values.

I used a small hack to export data from Dolt into a DataFrame using parquet instead of CSV. This requires the pyarrow dependency.

I left TODOs for improvements on the Dolt side that would make this code cleaner and Dolt issues for the associated features.

There is one bug with NULL datetime values that I added a Dolt issue for.

max-hoffman commented 2 years ago

re: https://github.com/dolthub/doltpy/issues/179

codecov-commenter commented 2 years ago

Codecov Report

Merging #183 (83a5f9e) into main (f3c83cc) will increase coverage by 1.10%. The diff coverage is 95.65%.

@@            Coverage Diff             @@
##             main     #183      +/-   ##
==========================================
+ Coverage   42.88%   43.98%   +1.10%     
==========================================
  Files          23       23              
  Lines         977      998      +21     
==========================================
+ Hits          419      439      +20     
- Misses        558      559       +1     
Impacted Files Coverage Δ
doltpy/cli/read.py 97.05% <95.65%> (-2.95%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f3c83cc...83a5f9e. Read the comment docs.