[BUG] PowerLawDistro output does not match power law distribution

Describe the bug PowerLawDistro (within nvtabular.tools.data_gen, current code link) produces data that does not match the expected power law distribution.

Steps/Code to reproduce bug Testing with the scipy distribution as a reference:

>>> import scipy.stats
>>> import nvtabular.tools.data_gen as datagen
>>> scipy.stats.kstest(scipy.stats.powerlaw.rvs(0.1, size=131072), 'powerlaw', args=[0.1])
KstestResult(statistic=0.002371134854473267, pvalue=0.4525845317987432)

Returned values can vary, but generally the KS statistic is small and p-value > 0.01, indicating that data generated with the scipy distribution matches scipy's power-law CDF.

On the other hand, testing with the nvtabular implementation gives:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.1])
KstestResult(statistic=0.674772961650085, pvalue=0.0)

KS statistic is large and p-value is zero or close to zero, reflecting a high certainty scipy and nvtabular distributions are different.

At a glance it appears the nvtabular distribution with alpha = 0.1 is actually equivalent to the scipy distribution with alpha = 0.9:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.9])
KstestResult(statistic=0.00205615918560631, pvalue=0.6365853093965469)

Expected behavior I would expect nvtabular and scipy distributions to be statistically similar with the same alpha value, along the lines of:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.1])
KstestResult(statistic=0.00205615918560631, pvalue=0.6365853093965469)

Environment details Docker container, using the image nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.09.

NVIDIA-Merlin / NVTabular

[BUG] PowerLawDistro output does not match power law distribution #1233