autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
58 stars 9 forks source link

SVM Model Score Difference in Python and Java #18

Closed fnc11 closed 2 years ago

fnc11 commented 2 years ago

I have trained one SVM model for activity recognition task [static, dynamic]. Original Device Data [acc_x, acc_y, acc_z, activity] Took 100 data points or 2 secs (device frequency = 50) data, i.e. took 100 acc_x, 100 acc_y,100 acc_z, 100 activity. Extracted mean and std from these sequences, so features list will be [mean_x, mean_y, mean_z, std_x, std_y, std_z] and label is mode(100 activities).

So below is whole procedure how to reproduce the issue.

  1. Train an SVM model.
  2. Save it as a pmml file using sklearn2pmml.
  3. Load the model again to see the model was properly saved.
  4. Use this file in Java for predictions.
  5. Compare the predictions from Java.

The issue is when using the pmml model in Java it is giving different predictions than the model which was saved.

I am attaching train_seqs, train_labels and test_seqs, test_labels as CSV files. train_ft_seqs.csv test_ft_seqs.csv

#Python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn2pmml import PMMLPipeline, sklearn2pmml

#training a SVM model
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(tfm_train_seqs, train_labels)
clf.score(tfm_test_seqs, test_labels)
>> 0.9754689754689755

#coverting SVM model to pmml model
pipeline = PMMLPipeline([ ('svm_classifier', clf) ])
pipeline.score(tfm_test_seqs, test_labels)
bf_predicted_labels = pipeline.predict(tfm_test_seqs) 
>> 0.9754689754689755
sklearn2pmml(pipeline, 'svm_SD.pmml', with_repr = True)

# loading the saved model and checking the difference between predictions, as there is no direct method to calculate score.
from pypmml import Model
svm_model = Model.fromFile('svm_SD.pmml')
af_predicted_labels = svm_model.predict(tfm_test_seqs)
af_predicted_labels = [ predicted_label[0] for predicted_label in af_predicted_labels]

conflict_ids = list()
for i, bf_label, af_label, actual_label in zip(list(range(len(bf_predicted_labels))), bf_predicted_labels, af_predicted_labels, test_labels):
    if bf_label != af_label:
        conflict_ids.append(i)
print(len(conflict_ids))
>> 0

#Java
import org.pmml4s.model.Model;

Model model = Model.fromFile(Main.class.getClassLoader().getResource("svm_SD.pmml").getFile());
List<Integer> predictedLabels = getBatchPredictions(ftSeqs);

public static List<Integer> getBatchPredictions(List<Double[]> ftSeqs) {
        List<Integer> predictedLabels = new ArrayList<>();
        for (Double[] ftSeq : ftSeqs) {
            Object[] result = model.predict(ftSeq);
            int predictedLabel = ((Long) result[0]).intValue();
            predictedLabels.add(predictedLabel);
        }
        return predictedLabels;
}
## Score in this case was 97.57896424563091, calculated using actual labels and predicted labels.

## Issue
conflict_ids = list()
for i, bf_label, java_label, actual_label in zip(list(range(len(bf_predicted_labels))), bf_predicted_labels, java_predicted_labels, test_labels):
    if bf_label != java_label:
        conflict_ids.append(i)
print(len(conflict_ids))
print(conflict_ids)
>> 18
>>[27, 1486, 1526, 1612, 1935, 2023, 2420, 2725, 3044, 3352, 3379, 4009, 4202, 4918, 4922, 5210, 5233, 5234]
scorebot commented 2 years ago

@fnc11 I can generate the model svm_SD.pmml, but I can not reproduce the issue of Java based on the latest code, here I used the Scala API that should be same as Java:

val model = Model.fromFile("svm_SD.pmml")
val src = Source.fromFile("x_test.csv")
val iter = src.getLines().drop(1).map(_.split(",")).toList

val result = iter.map(x => {
  model.predict(x)(0)
})

import java.nio.file.{Paths, Files}
import java.nio.charset.{StandardCharsets}
Files.write(Paths.get("java_predicted_labels.csv"), ("prediction" :: result.map(_.toString)).mkString("\n").getBytes(StandardCharsets.UTF_8))

Then I load the predictions of the file java_predicted_labels.csv in Python to compare:

java_predicted_labels = pd.read_csv('java_predicted_labels.csv')
java_predicted_labels = java_predicted_labels.iloc[:, 0].tolist()
conflict_ids = list()
for i, bf_label, java_label, actual_label in zip(list(range(len(bf_predicted_labels))), bf_predicted_labels, java_predicted_labels, y_test):
    if bf_label != java_label:
        conflict_ids.append(i)
print(len(conflict_ids))
print(conflict_ids)
0
[]

It could be caused by the old version of PMML4S used, sorry I just pushed the latest version 0.9.13 to the Maven repository. Could you try it?

scorebot commented 2 years ago

@fnc11 Does the latest 0.9.13 work for you?

fnc11 commented 2 years ago

Dear @scorebot,

I have updated the version number in dependencies but still the model score value didn't change so I think it didn't get fixed. I am attaching more code from my Java implementation, can you try with this?

Double score = getModelScoreWithFeatures("src/main/resources/test_ft_seqs.csv");
System.out.println(score);

Double getModelScoreWithFeatures(String fileName) {
    List<FeatureSequence> ftSeqs = readCSVSeqs(fileName);
    System.out.println(ftSeqs.get(0));
    List<Integer> groundTruthLabels = new ArrayList<>();
    List<Double[]> fts = new ArrayList<>();
    for(FeatureSequence featureSequence: ftSeqs){
        fts.add(featureSequence.features);
        groundTruthLabels.add(featureSequence.label);
    }

    List<Integer> predictedLabels = getBatchPredictions(fts);
    savePredictions(predictedLabels, "src/main/resources/java_predicted_labels.csv");
    int correct = 0;
    int allSeqs = fts.size();
    for(int i=0;i<allSeqs;i++){
        if(Objects.equals(groundTruthLabels.get(i), predictedLabels.get(i))){
            correct++;
        }
    }

    return 100*((double)correct/allSeqs);
}

List<FeatureSequence> readCSVSeqs(String fileName) {
    List<FeatureSequence> ftSeqs = new ArrayList<>();
    CSVParser parser = new CSVParserBuilder()
            .withSeparator(',')
            .withFieldAsNull(CSVReaderNullFieldIndicator.EMPTY_QUOTES)
            .withIgnoreLeadingWhiteSpace(true)
            .build();
    try {
        CSVReader csvReader = new CSVReaderBuilder(new FileReader(fileName))
                .withSkipLines(1)
                .withCSVParser(parser)
                .build();
        // read all records at once
        List<String[]> records = csvReader.readAll();
        // iterate through list of records
        for (String[] record : records) {
            if (record.length > 0) {
                Double[] dFeatures = new Double[6];
                //                System.out.println(record[1]);
                String[] features = record[0].replace('[', ' ').replace(']', ' ').split("\\s+");
                List<String> validFeatures = new ArrayList<>();
                for(String feature: features){
                    if (!feature.equals("")){
                        validFeatures.add(feature);
                    }
                }
                for (int i = 0; i < 6; i++) {
                    dFeatures[i] = Double.parseDouble(validFeatures.get(i));
                }
                FeatureSequence ftSeq = new FeatureSequence(dFeatures, Integer.parseInt(record[1]));
                ftSeqs.add(ftSeq);
            }
        }
    } catch (IOException | CsvException e) {
        e.printStackTrace();
    }
    return ftSeqs;
}

List<Integer> getBatchPredictions(List<Double[]> ftSeqs) {
    List<Integer> predictedLabels = new ArrayList<>();
    for (Double[] ftSeq : ftSeqs) {
        Object[] result = model.predict(ftSeq);
        int predictedLabel = ((Long) result[0]).intValue();
        predictedLabels.add(predictedLabel);
    }
    return predictedLabels;
}

void savePredictions(List<Integer> predictedLabels, String fileName) {
    CSVWriter writer = null;
    try {
        writer = new CSVWriter(new FileWriter(fileName));
        List<String[]> lines = convertToStringArrary(predictedLabels);
        for (String[] line : lines) {
            writer.writeNext(line);
        }
        writer.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

}

List<String[]> convertToStringArrary(List<Integer> predictedLabels) {
    List<String[]> covertedLabels = new ArrayList<>();
    for(Integer label: predictedLabels){
        String[] line = new String[1];
        line[0] = label.toString();
        covertedLabels.add(line);
    }
    return covertedLabels;
}

public class FeatureSequence {
    Double[] features;
    Integer label;

    public FeatureSequence(Double[] features, int label) {
        this.features = features;
        this.label = label;
    }

    @Override
    public String toString() {
        return "FeatureSequence{" +
                "features=" + Arrays.toString(features) +
                ", label=" + label +
                '}';
    }
}

Here are the dependencies,

<dependency>
    <groupId>org.pmml4s</groupId>
    <artifactId>pmml4s_2.13</artifactId>
    <version>0.9.13</version>
</dependency>
<dependency>
    <groupId>com.opencsv</groupId>
    <artifactId>opencsv</artifactId>
    <version>5.5.2</version>
</dependency>
scorebot commented 2 years ago

I created a new java project based on the dependencies of Maven, and tried the code above, the result is correct:

image

image

You need to clean and rebuild your project. BTW, which version of PMML4S was used before 0.9.13?

scorebot commented 2 years ago

@fnc11 Please, let me know if you still have a problem

fnc11 commented 2 years ago

Sorry for late reply, got busy in some other work.

The issue is still there, I am attaching the whole project as zip file, maybe you can spot the issue. I cleaned and tried running again, still the score value was wrong. ActivityPredictionSVM.zip

scorebot commented 2 years ago

Oh, it's caused by the exported PMML models, my model is different from yours, I attached it, you can try. svm_SD.pmml.txt

I use the sklearn2pmml:

pip show sklearn2pmml
Name: sklearn2pmml
Version: 0.77.0
Summary: Python library for converting Scikit-Learn pipelines to PMML
Home-page: https://github.com/jpmml/sklearn2pmml
Author: Villu Ruusmann
Author-email: villu.ruusmann@gmail.com
License: GNU Affero General Public License (AGPL) version 3.0
Location: /Users/scorebot/anaconda3/lib/python3.7/site-packages
Requires: scikit-learn, sklearn-pandas, joblib
Required-by:

You probably use an old version. please try to update it, then export a model again. My scikit-learn is 0.23.2

scorebot commented 2 years ago

@fnc11 Did you get the new PMML model to resolve your issue?

scorebot commented 2 years ago

Close it. if you have other problems, please feel free to open a new one.