BayesWitnesses / m2cgen

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies
MIT License
2.77k stars 239 forks source link

Accuracies totally different #524

Open harishprabhala opened 2 years ago

harishprabhala commented 2 years ago

Hi. I am converting the tree model to C using m2cgen. Although the inference latencies are much lower, the accuracies are way off. Here's how I am converting and reading the .so files

from xgboost import XGBRFRegressor
num_est=100

model = XGBRFRegressor(n_estimators=num_est, max_depth=8)
model.fit(X_train, y_train)

code = m2c.export_to_c(model)
len(code)

with open('model.c', 'w') as f:
    f.write(code)

!gcc -Ofast -shared -o lgb_score.so -fPIC model.c
!ls -l lgb_score.so

lib = ctypes.CDLL('./lgb_score.so')
score = lib.score
# Define the types of the output and arguments of this function.
score.restype = ctypes.c_double
score.argtypes = [ndpointer(ctypes.c_double)]

Why is this happening and how can I fix it?

StrikerRUS commented 2 years ago

Hey @harishprabhala !

Are you able to provide a MRE for your issue?

harishprabhala commented 2 years ago
import zipfile
import urllib.request as urllib
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip'

filehandle, _ = urllib.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
filename = zip_file_object.namelist()[0]
bytes_data = zip_file_object.open(filename).read()

import pandas as pd
from io import BytesIO
from sklearn.model_selection import train_test_split

import numpy as np

year = pd.read_csv(BytesIO(bytes_data), header = None)

#train_size = 463715  # Note: this will extend the training time if we do the full dataset
train_size = 200000
X = year.iloc[:, 1:]
y = year.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False, train_size=train_size, test_size=51630, random_state=4)

# Store the test data as numpy by pulling the values out of the pandas dataframe
data = np.array(X_test.values)

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRFRegressor
num_est=50

model = XGBRFRegressor(n_estimators=num_est, max_depth=8)
model.fit(X_train, y_train)
import pandas as pd
import m2cgen as m2c
# import lightgbm as lgb
# import xgboost as xgb
import ctypes
import io
from numpy.ctypeslib import ndpointer

code = m2c.export_to_c(model)
len(code)

with open('model.c', 'w') as f:
    f.write(code)

!gcc -Ofast -shared -o xgb_score.so -fPIC model.c
!ls -l xgb_score.so

lib = ctypes.CDLL('./xgb_score.so')
score = lib.score
# Define the types of the output and arguments of this function.
score.restype = ctypes.c_double
score.argtypes = [ndpointer(ctypes.c_double)]

training_predictions = pd.Series(model.predict(data))
training_predictions.tail(20)

compiled_predictions = pd.Series([score(row) for row in data])
compiled_predictions.tail(20)

In the last two commands, you can see that the predictions are completely different.

harishprabhala commented 2 years ago

Hey @StrikerRUS did you get a chance to reproduce the issue?