NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Stricter check for query planning. #107

Open ayushdg opened 2 weeks ago

ayushdg commented 2 weeks ago

Description

Importing newer versions of dask (tested with 2024.5) set dataframe.query-planning to None instead of True or False, but the None case defaults to using query planning. This PR extends the check to None to raise relevant errors w/ newer versions of dask.

Usage

N/A

Checklist

ayushdg commented 2 weeks ago

cc: @rjzamora If you could take a look

ayushdg commented 2 weeks ago

Is the goal here to raise an error in any case that the user does not explicitly opt out of query-planning? This means all curator users will see this error unless they explicitly set ther config to false ahead of time.

The intended goal is to raise this error whenever Query-Planning is enabled and dask is going down that route. In newer versions of dask, I've noticed that this value is being set to None on importing dask.dataframe, but still going down the query planning route. I now realize that it could be None in the case where dask.dataframe is not imported at all. I'll think about this a bit more, but curious if you see better ways of checking whether query planning is truly enabled.

rjzamora commented 2 weeks ago

I now realize that it could be None in the case where dask.dataframe is not imported at all. I'll think about this a bit more, but curious if you see better ways of checking whether query planning is truly enabled.

Right - I believe you have stumbled upon the primary reason we rapids-24.04 was pinned to such an "old" version of dask: There is currently no reliable way (that I know of) to check if the user has already imported dask.dataframe before dask_cudf is imported :/

That said, it may be possible for us to add a canonical "switch" in rapids-dask-dependency.