autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
59 stars 9 forks source link

Model Scoring Differences Between Python/jpmml and pmml4s #31

Closed zackyenchik closed 2 weeks ago

zackyenchik commented 3 weeks ago

Hello! I have a PMML model trained with scikit-learn and extracted to PMML with sklearn2pmml. For some reason, the model is scoring differently between Python/jpmml and pmml4s:

Python PMML 0.48976458967678604 0.48976458967678604 0.7660308225499471 0.7660308225499471 0.38325820040056524 0.38325820040056524 0.38607212482501463 0.38607212482501463 0.49769546260665454 0.49769546260665454

pmml4s 0.26909560427731416 0.24756982049868986 0.24974076556763675 0.254523400551614 0.18821498178901241

Some code snippets that may be helpful:

df_mapper = DataFrameMapper([
    (['bot_reference'], None),
    (['favorites_count'], None),
    (['followers_count'], None),
    (['friends_count'], None),
    (['has_description'], None),
    (['has_location'], None),
    (['last_status_hashtags'], None),
    (['last_status_mentions'], None),
    (['name_length'], None),
    (['source'], LabelBinarizer()),
    (['status_isretweet'], None),
    (['status_possibly_sensitive'], None),
    #(['statuses_count'], None),
    (['string_entropy'], None),
    #(['tweets_per_day'], None),
    (['verified'], None)
])

# Include the rf model in a pipeline with preprocessing
pipeline = Pipeline([
    ('data_map', df_mapper),
    ('random', RandomForestClassifier(n_estimators=10, max_depth=5,n_jobs=-1))
])
pmml_pipeline = PMMLPipeline([
    ("mapper", pipeline.named_steps['data_map']),
    ("classifier", pipeline.named_steps['random'])
])

sklearn2pmml(pmml_pipeline, "scramble.pmml", with_repr = True)
from jpmml_evaluator import make_evaluator

model_pmml = make_evaluator(pmml_file_path) \
    .verify()

probabilities_python = pipeline.predict_proba(X_test)
probabilities_pmml = model_pmml.evaluateAll(X_test)
for i in range(len(test)):
    print(probabilities_python[i][1], probabilities_pmml['probability(1)'][i])

Python library versions: dill==0.3.8 joblib==1.4.2 jpmml_evaluator==0.10.2 JPype1==1.5.0 numpy==1.26.4 packaging==24.1 pandas==2.2.2 py4j==0.10.9.7 pyjnius==1.6.1 python-dateutil==2.9.0.post0 pytz==2024.1 scikit-learn==1.5.0 scipy==1.14.0 setuptools==70.1.1 six==1.16.0 sklearn-pandas==2.2.0 sklearn2pmml==0.109.0 threadpoolctl==3.5.0 tzdata==2024.1

Java classpath: opencsv-3.10 pmml4s_3-1.0.1 scala3-library_3-3.4.0 scala-library-2.13.12 spray-json_3-1.3.6

Java program used to test pmml4s:

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import org.pmml4s.model.Model;

import com.opencsv.CSVReader;

public class PmmlPredictor {

    private Model pmmlModel;
    private List<Double> predictions;

    public PmmlPredictor() {
        predictions = new ArrayList<>();
    }

    public boolean loadModel(String modelFilename) {
        try {
            pmmlModel = Model.fromFile(modelFilename);
            return true;
        } catch (Exception e) {
            e.printStackTrace();
            return false;
        }
    }

    public void run(String inputFilename) {
        try {
            CSVReader reader = new CSVReader(new FileReader(inputFilename));
            String [] record;
            String[] headers = reader.readNext();
            while ((record = reader.readNext()) != null) {
                double prediction = predict(headers, record);
                predictions.add(prediction);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }  

        System.out.println(predictions);
    }

    public double predict(String[] headers, String[] record) {
        Map<String, Object> arguments = new LinkedHashMap<>();
        String[] inputFields = pmmlModel.inputNames();
        for (int i = 0; i < inputFields.length; i ++) {
            String inputField = headers[i];
            String value = record[i];
            arguments.put(inputField, value);
        }
        Map<String, ?> result = pmmlModel.predict(arguments);
        double prediction = (double) result.get("probability(1)");
        return prediction;
    }

    public static void main(String[] args) {
        String pmmlModel = "scramble.pmml";
        String input = "test.csv";
        PmmlPredictor predictor = new PmmlPredictor();
        if (predictor.loadModel(pmmlModel)) {
            predictor.run(input);
        }
    }
}

I'd be happy to share the PMML model as well but it doesn't look like I can attach it here. Let me know if there's anything else you need from me to sort this out! Thank you in advance!

scorebot commented 3 weeks ago

@zackyenchik Can you please share your model and those 5 lines' datasets with me? You can send them to scorebot#outlook.com, thanks.

scorebot commented 3 weeks ago

@zackyenchik Thanks for your model and dataset, I can reproduce the issue, which is caused by the input dataset not matching the model completely, for example, in all those boolean fields, the dataset contains values like "True" or "False", but the model expects their values should be 1.0 or 0.0 because they were defined as the following format as double in the PMM model:

<DataField name="bot_reference" optype="continuous" dataType="double"/>

The scoring library can't convert those values successfully, so all those values were treated as missing, that's the reason why the incorrect results were returned.

You will get the same results if you convert those values "True"/"False" to 1.0/0.0. we will also enhance the utility of data conversion in PMML4S to handle the case automatically.

zackyenchik commented 3 weeks ago

Ah that makes sense. Thank you for the quick follow up!