NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] PowerLawDistro output does not match power law distribution #1233

Open vysarge opened 2 years ago

vysarge commented 2 years ago

Describe the bug PowerLawDistro (within nvtabular.tools.data_gen, current code link) produces data that does not match the expected power law distribution.

Steps/Code to reproduce bug Testing with the scipy distribution as a reference:

>>> import scipy.stats
>>> import nvtabular.tools.data_gen as datagen
>>> scipy.stats.kstest(scipy.stats.powerlaw.rvs(0.1, size=131072), 'powerlaw', args=[0.1])
KstestResult(statistic=0.002371134854473267, pvalue=0.4525845317987432)

Returned values can vary, but generally the KS statistic is small and p-value > 0.01, indicating that data generated with the scipy distribution matches scipy's power-law CDF.

On the other hand, testing with the nvtabular implementation gives:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.1])
KstestResult(statistic=0.674772961650085, pvalue=0.0)

KS statistic is large and p-value is zero or close to zero, reflecting a high certainty scipy and nvtabular distributions are different.

At a glance it appears the nvtabular distribution with alpha = 0.1 is actually equivalent to the scipy distribution with alpha = 0.9:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.9])
KstestResult(statistic=0.00205615918560631, pvalue=0.6365853093965469)

Expected behavior I would expect nvtabular and scipy distributions to be statistically similar with the same alpha value, along the lines of:

>>> scipy.stats.kstest(datagen.PowerLawDistro(0.1).create_col(131072).to_arrow().to_pylist(), 'powerlaw', args=[0.1])
KstestResult(statistic=0.00205615918560631, pvalue=0.6365853093965469)

Environment details Docker container, using the image nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.09.

EvenOldridge commented 2 years ago

@albert17 Can you take a look. Per @vysarge I think we've set this to 1-alpha instead of alpha.