NVIDIA-Merlin / core

Core Utilities for NVIDIA Merlin
Apache License 2.0
19 stars 14 forks source link

Add row_group_size argument to Dataset.to_parquet #218

Closed rjzamora closed 1 year ago

rjzamora commented 1 year ago

Adds simple row_group_size argument to Dataset.to_parquet, allowing users to set the maximum number of rows desired in a single Parquet row-group.

rjzamora commented 1 year ago

cc @rnyak

rnyak commented 1 year ago

@rjzamora looks like some tests are failing.

rjzamora commented 1 year ago

looks like some tests are failing.

Right, I guess CI is using a pretty old version of cudf - The necessary feature was exposed in cudf in 22.08 (https://github.com/rapidsai/cudf/pull/10980). I guess we could just raise an error when the user tries to use this option with an older cudf release.

karlhigley commented 1 year ago

I think it's reasonable to require a new enough version of cudf instead of making code changes to accommodate older versions. I'm working on getting the CI image back up to date—with any luck, hopefully in the next week or so. (It's a slow process since it involves waiting on a lot of container builds.)

rjzamora commented 1 year ago

I think it's reasonable to require a new enough version of cudf instead of making code changes to accommodate older versions. I'm working on getting the CI image back up to date—with any luck, hopefully in the next week or so. (It's a slow process since it involves waiting on a lot of container builds.)

That makes sense to me - Thanks @karlhigley !

rnyak commented 1 year ago

@rjzamora and @karlhigley when do you think this PR can be revisited based on new cudf version comment? thanks.

karlhigley commented 1 year ago

The CI image has been updated to match our containers, so hopefully the tests will pass now 🤞🏻