databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
351 stars 60 forks source link

Using text generator resulting in error #299

Open anoopnarang opened 1 month ago

anoopnarang commented 1 month ago

Expected Behavior

Should work without error

Current Behavior

Getting the following error

  File "./dependencies.zip/dbldatagen/text_generators.py", line 881, in pandasGenerateText
    results = self.generateText(rows, rows.size)
  File "./dependencies.zip/dbldatagen/text_generators.py", line 768, in generateText
    para_stats = np.clip(para_stats_raw, self._minValues, self._maxValues, out=stats_array)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 2247, in clip
    return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 66, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/fromnumeric.py", line 46, in _wrapit
    result = getattr(arr, method)(*args, **kwds)
  File "/usr/local/lib64/python3.9/site-packages/numpy/_core/_methods.py", line 108, in _clip
    return um.clip(a, min, max, out=out, **kwargs)
numpy._core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'clip' output from dtype('float64') to dtype('uint8') with casting rule 'same_kind'

Steps to Reproduce (for bugs)

Install dbldatagen using pip install dbldatagen

Generate a custom dataset with a text generator column

 .withColumn("essay", text=dg.ILText(paragraphs=(1, 4), sentences=(2, 6)), random=True)

Context

Trying to create a regular dataset with a text column, it throws this error. Other type of columns work fine. I think AWS Emr serverless by default is using newer versions of numpy which is not compatible with dbldatagen.

Your Environment

ronanstokes-db commented 1 month ago

Is this on a Databricks runtime environment ? If so, the version of Numpy and Pandas used are determined by the Databricks runtime.

Which version of the Databricks runtime was being used ?