NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] TargetEncoding with multiple target columns makes targets to be switches #1839

Open gabrielspmoreira opened 1 year ago

gabrielspmoreira commented 1 year ago

Describe the bug When TargetEncoding op is used with multiple target columns, it might switch the content of the target columns. Furthermore, the internal statistics (count, sum) saved with the NVT workflow for the target columns are also switched.

Steps/Code to reproduce bug

    def generate_nvt_workflow_features(self):
         ...
        outputs = reduce(lambda x, y: x + y, list(feats.values()))

        ###################### ADD THIS ######################
        target_encoding = (
            "f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,"
            "f_14,f_15,f_16,f_17,"
            "f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28"
            ",f_29,f_30,f_31,f_32".split(",")
            >> nvt.ops.TargetEncoding(
                ["is_clicked", "is_installed"],
                kfold=5,
                p_smooth=10,
                out_dtype="float32",
            )
        )

        outputs = outputs + target_encoding
        ######################

        workflow = nvt.Workflow(outputs, client=self.dask_cluster_client)

Expected behavior TargetEncoding should not switch the target columns values and also target encoded feature values.

Environment details (please complete the following information):