NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k
stars
143
forks
source link
[BUG] PowerLawDistro output does not match power law distribution #1233
Describe the bug
PowerLawDistro (within nvtabular.tools.data_gen, current code link) produces data that does not match the expected power law distribution.
Steps/Code to reproduce bug
Testing with the scipy distribution as a reference:
Returned values can vary, but generally the KS statistic is small and p-value > 0.01, indicating that data generated with the scipy distribution matches scipy's power-law CDF.
On the other hand, testing with the nvtabular implementation gives:
Describe the bug PowerLawDistro (within
nvtabular.tools.data_gen
, current code link) produces data that does not match the expected power law distribution.Steps/Code to reproduce bug Testing with the scipy distribution as a reference:
Returned values can vary, but generally the KS statistic is small and p-value > 0.01, indicating that data generated with the scipy distribution matches scipy's power-law CDF.
On the other hand, testing with the nvtabular implementation gives:
KS statistic is large and p-value is zero or close to zero, reflecting a high certainty scipy and nvtabular distributions are different.
At a glance it appears the nvtabular distribution with alpha = 0.1 is actually equivalent to the scipy distribution with alpha = 0.9:
Expected behavior I would expect nvtabular and scipy distributions to be statistically similar with the same alpha value, along the lines of:
Environment details Docker container, using the image
nvcr.io/nvidia/merlin/merlin-tensorflow-training:21.09
.