[ML] Add test to reproduce rare feature importance issue

pheyos commented 2 years ago

In the Kibana functional tests, we have two data frame analytics jobs, that sometimes (rarely) don't produce any features although they usually do. In order to stay focus on UI behavior, we'll adjust the Kibana tests to not fail in that case, but it would be good to cover the same scenario in Elasticsearch integration tests, so if it fails again it's easier to debug.

Here is what the Kibana tests are doing (left away the Kibana specific data view creation and space sync operation):

Load the ihp_outlier data archive (loads data into the ft_ihp_outlier index
- this is a downsampled version of our Iowa house prices dataset, which is mainly used for outlier detection (hence the name), but also works for classification
- loading is done via Kibana's esArchiver tool, which basically just creates the index with the provided mappings file and then loads the data file (new line separated JSON blobs, not NDJSON)
- Mappings file: mappings.tar.gz
- Data file: data.json.gz
Create classification jobs using the ES API (job configurations are listed below)
Run the jobs (first wait for training record count to be positive, so we know it actually started, then wait for state to be stopped so we know it's done)
Look at the results

Job configuration 1

``` { id: 'ihp_fi_binary', description: 'binary classification job', source: { index: ['ft_ihp_outlier'], query: {match_all: {}}, }, dest: { index: `user-ihp_fi_binary`, results_field: 'ml_central_air', }, analyzed_fields: { includes: [ 'CentralAir', 'GarageArea', 'GarageCars', 'YearBuilt', 'Electrical', 'Neighborhood', 'Heating', '1stFlrSF', ], }, analysis: { classification: { dependent_variable: 'CentralAir', num_top_feature_importance_values: 5, training_percent: 35, prediction_field_name: 'CentralAir_prediction', num_top_classes: -1, max_trees: 10, }, }, model_memory_limit: '60mb', allow_lazy_start: false, } ```

Job configuration 2

``` { id: 'ihp_fi_multi', description: 'multi class classification job', source: { index: ['ft_ihp_outlier'], query: {match_all: {}}, }, dest: { index: 'user-ihp_fi_multi'. results_field: 'ml_heating_qc', }, analyzed_fields: { includes: [ 'CentralAir', 'GarageArea', 'GarageCars', 'Electrical', 'Neighborhood', 'Heating', '1stFlrSF', 'HeatingQC', ], }, analysis: { classification: { dependent_variable: 'HeatingQC', num_top_feature_importance_values: 5, training_percent: 35, prediction_field_name: 'heatingqc', num_top_classes: -1, max_trees: 10, }, }, model_memory_limit: '60mb', allow_lazy_start: false, } ```

elasticmachine commented 2 years ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 2 years ago

Thanks @pheyos. The data file is 1.3MB, which is too big to include in an Elasticsearch integration test. However, it's only 1460 documents, so we should be able to include it if we store the data in the test code in a terse format like CSV. We should also be able to get rid of all the fields that aren't mentioned in the job configs to make the data even smaller.

droberts195 commented 2 years ago

The screenshots that show the issue that sometimes occurred in the Kibana test are these:

image (5) image (6)

The top image shows what's supposed to happen and the bottom image shows "the data is uniform" and no feature importance.

If we add an Elasticsearch test case that does exactly the same test then we can switch on debug level logging for the C++ code during that test and hopefully see why it sometimes doesn't produce any feature importance information (and if necessary add more debug level logging to the C++ code).

valeriy42 commented 2 years ago

We are using ca. 500 samples out of 1460. It seems that class CetralAir="Y" is dominating, so if we are getting very unlucky with our sample, we would always predict CetralAir="Y" and get no feature importance. It would be helpful if the integration test would capture or allow the reproduction of the class distribution in the training sample.

dimitris-athanasiou commented 2 years ago

@valeriy42 We should be preserving the distribution of the classes as we do stratified sampling.

valeriy42 commented 2 years ago

Indeed, but if there are too few minor samples in the training set and they contain too little information, the model can decide to predict the majority class only.

droberts195 commented 2 years ago

It's not good that when a classification job is run on the Kibana sample data sometimes it produces nice results and sometimes it doesn't. A user's first impressions of our analytics could be one of the situations where it says the data is uniform. What could we change to make classification always produce nice results on the Kibana sample data? It could be that we change some defaults or make our sampling cleverer or make our defaults more dynamic to the size of the input data or something else. But there must be some improvement we can make.

droberts195 commented 2 years ago

Closed by #89307

elastic / elasticsearch

[ML] Add test to reproduce rare feature importance issue #88536