NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Fix metadata inference with pandas and dask #35

Closed ryantwolf closed 2 months ago

ryantwolf commented 2 months ago

Prevents Dask from passing pd.NA to the filters for type inference on the scoring and filtering functions. Also fixes some issues with task decontamination working with pandas 2.0 strings and exploding.

With task decontamination, we convert the document text column (dtype=string) to a list of split documents (dtype=object). When calling explode on this column of split documents, the column maintains its object datatype even though now it's only strings. We need to recast the column for newer versions of pandas/dask where string and object are different datatypes.

ayushdg commented 2 months ago

cc: @rjzamora if you want to take a look