dask / dask-tutorial

Dask tutorial
https://tutorial.dask.org
BSD 3-Clause "New" or "Revised" License
1.83k stars 698 forks source link

noob at dask, how to read a 3GB stata file? #61

Closed sagar-m closed 6 years ago

sagar-m commented 6 years ago

Hi, I am trying to read a 3 GB stata file to analyze on python. I just completed the dask tutorials on datacamp.

This code works:

data = pd.read_stata('/Users/sherrymukim/Documents/nfhs/IAHR71DT/IAHR71FL.DTA',chunksize=100000)

But the following takes forever:

for chunk in data:
    print(chunk.shape)

My macbook has just 2GB RAM, and I will be switching to a higher RAM laptop in one month.

How do I even preview the file to know what columns are there?

Please help!!! Thank you.

I am stuck on this for one week. :-(

mrocklin commented 6 years ago

cc @makmanalp

sagar-m commented 6 years ago

It seems Dask.bag and Dask.dataframe do not work with stata files.

mrocklin commented 6 years ago

See https://twitter.com/makmanalp/status/969002735512244224

On Mon, Mar 5, 2018 at 9:47 AM, sagar-m notifications@github.com wrote:

It seems Dask.bag and Dask.dataframe do not work with stata files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tutorial/issues/61#issuecomment-370442076, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszL1VcgFF-uAUsBZnHwEEQK7DK_0bks5tbVATgaJpZM4ScNEq .

sagar-m commented 6 years ago

Thank you!!!

makmanalp commented 6 years ago

@mrocklin oops sorry I missed this, thanks for intervening! @sagar-m be warned that this is currently a hack and uses some internal variables from the guts of pandas, so expect that it might break if pandas changes their StataReader class.