elastic / ml-cpp

Machine learning C++ code
Other
149 stars 62 forks source link

[ML] Check inconsistent computation of number of top feature importance values #994

Closed valeriy42 closed 4 years ago

valeriy42 commented 4 years ago

There may be an incorrect report of the top feature importance values.

If num_top_feature_importance_values is set to 4, it return only 3, if it is set to 10 it returns 6, and if set to 7 it returns 4.

Job config

{
  "id": "dfa_breast-cancer-recurrence_1580826608_000_0",
  "source": {
    "index": [
      "breast-cancer-recurrence-classification"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "dest_breast_cancer_80_1580844608808",
    "results_field": "ml"
  },
  "analysis": {
    "classification": {
      "dependent_variable": "class",
      "num_top_feature_importance_values": 7,
      "num_top_classes": 2,
      "prediction_field_name": "class_prediction",
      "training_percent": 80,
      "randomize_seed": 80
    }
  },
  "model_memory_limit": "1gb",
  "create_time": 1580844612096,
  "version": "8.0.0",
  "allow_lazy_start": false
}
valeriy42 commented 4 years ago

I check on the stack build 78 and could not confirm the problem.

valeriy42 commented 4 years ago

I checked with QA, code behaves as expected.