dask / dask-tutorial

Dask tutorial
https://tutorial.dask.org
BSD 3-Clause "New" or "Revised" License
1.83k stars 702 forks source link

01_dask_delayed running time question #136

Closed wonderfreda closed 4 years ago

wonderfreda commented 4 years ago

While working on the exercise of this notebook on NYC flight average delay time, I tried using dd.read_csv() to read in all .csv files one shot, and did the following calculation:

import os
import dask
import dask.dataframe as dd

allfn = os.path.join('data', 'nycflights', '*.csv')
df = dd.read_csv(allfn, parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

df.groupby('Origin').DepDelay.mean().compute()

but it seems that this takes even longer than the sequential version of the code (6.51s on my machine), vs 3.94s of the sequential code, vs 1.68s of the solution.

I understand that this is not directly related to the delayed function discussed here, but would appreciate understanding the longer running time when importing all csv together, as I was under the impression that one advantage of dask dataframe is to be able to import many files one shot to save time. Thank you!

mrocklin commented 4 years ago

It could be many things. It depends a lot on your hardware. I recommend trying the multiprocessing scheduler to see if that speeds things up. Processing text is often bound by the GIL .

See https://docs.dask.org/en/latest/scheduling.html

On Tue, Nov 12, 2019 at 7:25 AM wonderfreda notifications@github.com wrote:

While working on the exercise of this notebook on NYC flight average delay time, I tried using dd.read_csv() to read in all .csv files one shot, and did the following calculation:

import os import dask import dask.dataframe as dd

allfn = os.path.join('data', 'nycflights', '*.csv') df = dd.read_csv(allfn, parse_dates={'Date': [0, 1, 2]}, dtype={'TailNum': str, 'CRSElapsedTime': float, 'Cancelled': bool})

df.groupby('Origin').DepDelay.mean().compute()

but it seems that this takes even longer than the sequential version of the code (6.51s on my machine), vs 3.94s of the sequential code, vs 1.68s of the solution.

I understand that this is not directly related to the delayed function discussed here, but would appreciate understanding the longer running time when importing all csv together, as I was under the impression that one advantage of dask dataframe is to be able to import many files one shot to save time. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tutorial/issues/136?email_source=notifications&email_token=AACKZTC4HJNOLS6MITH3CVTQTLDHTA5CNFSM4JMFBLD2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYXI42A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIDLJFMNTQZXH4NNDQTLDHTANCNFSM4JMFBLDQ .

TomAugspurger commented 4 years ago

In particular, I think the parsing multiple columns as a single datetime with parse_dates={"Date": [0, 1, 2]}) requires GIL holding by pandas.

Closing this, but let us know if you have other questions.