Closed wonderfreda closed 4 years ago
It could be many things. It depends a lot on your hardware. I recommend trying the multiprocessing scheduler to see if that speeds things up. Processing text is often bound by the GIL .
See https://docs.dask.org/en/latest/scheduling.html
On Tue, Nov 12, 2019 at 7:25 AM wonderfreda notifications@github.com wrote:
While working on the exercise of this notebook on NYC flight average delay time, I tried using dd.read_csv() to read in all .csv files one shot, and did the following calculation:
import os import dask import dask.dataframe as dd
allfn = os.path.join('data', 'nycflights', '*.csv') df = dd.read_csv(allfn, parse_dates={'Date': [0, 1, 2]}, dtype={'TailNum': str, 'CRSElapsedTime': float, 'Cancelled': bool})
df.groupby('Origin').DepDelay.mean().compute()
but it seems that this takes even longer than the sequential version of the code (6.51s on my machine), vs 3.94s of the sequential code, vs 1.68s of the solution.
I understand that this is not directly related to the delayed function discussed here, but would appreciate understanding the longer running time when importing all csv together, as I was under the impression that one advantage of dask dataframe is to be able to import many files one shot to save time. Thank you!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-tutorial/issues/136?email_source=notifications&email_token=AACKZTC4HJNOLS6MITH3CVTQTLDHTA5CNFSM4JMFBLD2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYXI42A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBIDLJFMNTQZXH4NNDQTLDHTANCNFSM4JMFBLDQ .
In particular, I think the parsing multiple columns as a single datetime with parse_dates={"Date": [0, 1, 2]})
requires GIL holding by pandas.
Closing this, but let us know if you have other questions.
While working on the exercise of this notebook on NYC flight average delay time, I tried using dd.read_csv() to read in all .csv files one shot, and did the following calculation:
but it seems that this takes even longer than the sequential version of the code (6.51s on my machine), vs 3.94s of the sequential code, vs 1.68s of the solution.
I understand that this is not directly related to the delayed function discussed here, but would appreciate understanding the longer running time when importing all csv together, as I was under the impression that one advantage of dask dataframe is to be able to import many files one shot to save time. Thank you!