dmlc / tl2cgen

TL2cgen (TreeLite 2 C GENerator) is a model compiler for decision tree models
https://tl2cgen.readthedocs.io/en/latest/
Apache License 2.0
17 stars 6 forks source link

Zero is treated as missing when loading data from a Pandas dataframe #11

Open adamreeve opened 2 years ago

adamreeve commented 2 years ago

I'm using Treelite 2.1.0 via the rapidsai/rapidsai-core-nightly:21.10-cuda11.2-base-ubuntu20.04-py3.8 docker image. In the below code I'd expect to get predictions of [0, 0, 1], as 0 is less than 1, but I get [1, 0, 1] when creating a DMatrix from a dataframe, as the 0 appears to be treated as a missing value. Using data from plain a plain numpy ndarray works as expected.

import numpy as np
import treelite
import treelite_runtime
import pandas as pd

builder = treelite.ModelBuilder(num_feature=1, average_tree_output=False)

tree = treelite.ModelBuilder.Tree()
tree[0].set_numerical_test_node(
        0, opname='<', threshold=1.0, default_left=False,
        left_child_key=1, right_child_key=2)
tree[1].set_leaf_node(0.0)
tree[2].set_leaf_node(1.0)
tree[0].set_root()
builder.append(tree)
model = builder.commit()

model.export_lib(toolchain='gcc', libpath='./testmodel.so', verbose=True)

predictor = treelite_runtime.Predictor('./testmodel.so')
test_data = np.array([
    [0.0],
    [0.5],
    [2.0],
], dtype=np.float32)

# Predict with numpy data
dmat = treelite_runtime.DMatrix(test_data)
preds = predictor.predict(dmat)
print(preds)
# Prints: [0. 0. 1.]

# Predict with a Pandas DataFrame
df = pd.DataFrame({'x0': test_data.reshape(-1)})
dmat = treelite_runtime.DMatrix(df)
preds = predictor.predict(dmat)
print(preds)
# Prints: [1. 0. 1.]

I can reproduce the pandas behaviour when using numpy by setting missing=0.0, but this parameter seems to have no effect when using Pandas, setting missing=np.nan doesn't help, which is kind of expected as that is supposed to be the default already.

mhq199657 commented 2 years ago

I am also facing this bug. Predictions using DMatrix created from pandas DataFrame is problematic whereas DMatrix created from df.to_numpy() is fine