blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
1k stars 139 forks source link

sas sas7bdat, stata .dta formats <-> HDF5 #129

Open benjello opened 9 years ago

benjello commented 9 years ago

Is there any plan to extend odo data migrator to the sas and stata data format ? For the time being, one have to go through pandas.DataFrame to simply convert these files to HDF5.

mrocklin commented 9 years ago

There was some work for sas using the sas7bdat library here

As always we would love contributions here. If you know a nice way to interact with sas and stata files through Python we would love to have you add those to odo.

talumbau commented 9 years ago

One difficulty is that sas7bdat is a closed format, so we rely on the to_data_frame capability in the sas7bdat package.

benjello commented 9 years ago

@mrocklin @talumbau: for my needs I use to_data_frame from sas7bdat package and read_stata/to_stata from pandas. I would definitely use a more flexible tool that can deal with very large table (bigger than available core memory)

mrocklin commented 9 years ago

The sas7bdat library referred to above does provide limited support of bigger-than-memory access through a Python iterator, which might be a bit slower than pulling out dataframes explicitly. This is already in odo which should use this library if you have sas7bdat installed.

Note that sas7bdat is new and incomplete. Odo performance is limited by sas7bdat's coverage.

benjello commented 9 years ago

Thank you @mrocklin . I will give it a try ASAP.

bashtage commented 9 years ago

@benjello The most recent release of pandas supports reading stata files using an iterator, so very large files can be sequentially imported.

cpcloud commented 9 years ago

This would be a nice thing to add to odo, and not incredibly difficult if anyone is interested in contributing. AFAIK, there isn't a strong motivation from our side to implement this so it would need an interested individual to implement it. I'm happy to help guide anyone through the process of adding this.

makmanalp commented 9 years ago

FYI, pandas reads stata with a patched version of pyDTA: http://pandas.pydata.org/pandas-docs/stable/io.html#stata-format

https://github.com/pydata/pandas/blob/master/pandas/io/stata.py