altair-viz / vega_datasets

A Python package for online & offline access to vega datasets
MIT License
173 stars 57 forks source link

`zipcodes()` returns a dataframe with incorrect dtype. #16

Closed yy closed 6 years ago

yy commented 6 years ago
from vega_datasets import data
zipcodes = data.zipcodes()
print(zipcodes.zip_code.dtype)

Expected: dtype('O') or rather CategoricalDtype(categories=['00501', '00544', ....

Actual: dtype('int64')

Some ZIP codes starts with "0" and zipcodes = data.zipcodes() removes all preceding zeros. The following works, but I think it's better to return with the correct dtypes by default.

zipcodes = data.zipcodes(dtype={'zip_code': 'category'})

Also found that data.unemployment() cannot correctly parse the data. One should specify the separator data.unemployment(sep='\t').

jakevdp commented 6 years ago

Thanks – these can be fixed by creating classes for each of these datasets in https://github.com/altair-viz/vega_datasets/blob/master/vega_datasets/core.py with appropriately modified values of _pd_read_kwds

If you're interested in submitting a pull request to fix these, I'd be happy to help you get started.

jakevdp commented 6 years ago

Fixed in #17 and #18