Closed pheyos closed 2 years ago
Pinging @elastic/ml-core (Team:ML)
Thanks @pheyos. The data file is 1.3MB, which is too big to include in an Elasticsearch integration test. However, it's only 1460 documents, so we should be able to include it if we store the data in the test code in a terse format like CSV. We should also be able to get rid of all the fields that aren't mentioned in the job configs to make the data even smaller.
The screenshots that show the issue that sometimes occurred in the Kibana test are these:
The top image shows what's supposed to happen and the bottom image shows "the data is uniform" and no feature importance.
If we add an Elasticsearch test case that does exactly the same test then we can switch on debug level logging for the C++ code during that test and hopefully see why it sometimes doesn't produce any feature importance information (and if necessary add more debug level logging to the C++ code).
We are using ca. 500 samples out of 1460. It seems that class CetralAir="Y"
is dominating, so if we are getting very unlucky with our sample, we would always predict CetralAir="Y"
and get no feature importance. It would be helpful if the integration test would capture or allow the reproduction of the class distribution in the training sample.
@valeriy42 We should be preserving the distribution of the classes as we do stratified sampling.
Indeed, but if there are too few minor samples in the training set and they contain too little information, the model can decide to predict the majority class only.
It's not good that when a classification job is run on the Kibana sample data sometimes it produces nice results and sometimes it doesn't. A user's first impressions of our analytics could be one of the situations where it says the data is uniform. What could we change to make classification always produce nice results on the Kibana sample data? It could be that we change some defaults or make our sampling cleverer or make our defaults more dynamic to the size of the input data or something else. But there must be some improvement we can make.
Closed by #89307
In the Kibana functional tests, we have two data frame analytics jobs, that sometimes (rarely) don't produce any features although they usually do. In order to stay focus on UI behavior, we'll adjust the Kibana tests to not fail in that case, but it would be good to cover the same scenario in Elasticsearch integration tests, so if it fails again it's easier to debug.
Here is what the Kibana tests are doing (left away the Kibana specific data view creation and space sync operation):
ihp_outlier
data archive (loads data into theft_ihp_outlier
indexesArchiver
tool, which basically just creates the index with the provided mappings file and then loads the data file (new line separated JSON blobs, not NDJSON)Job configuration 1
``` { id: 'ihp_fi_binary', description: 'binary classification job', source: { index: ['ft_ihp_outlier'], query: {match_all: {}}, }, dest: { index: `user-ihp_fi_binary`, results_field: 'ml_central_air', }, analyzed_fields: { includes: [ 'CentralAir', 'GarageArea', 'GarageCars', 'YearBuilt', 'Electrical', 'Neighborhood', 'Heating', '1stFlrSF', ], }, analysis: { classification: { dependent_variable: 'CentralAir', num_top_feature_importance_values: 5, training_percent: 35, prediction_field_name: 'CentralAir_prediction', num_top_classes: -1, max_trees: 10, }, }, model_memory_limit: '60mb', allow_lazy_start: false, } ```Job configuration 2
``` { id: 'ihp_fi_multi', description: 'multi class classification job', source: { index: ['ft_ihp_outlier'], query: {match_all: {}}, }, dest: { index: 'user-ihp_fi_multi'. results_field: 'ml_heating_qc', }, analyzed_fields: { includes: [ 'CentralAir', 'GarageArea', 'GarageCars', 'Electrical', 'Neighborhood', 'Heating', '1stFlrSF', 'HeatingQC', ], }, analysis: { classification: { dependent_variable: 'HeatingQC', num_top_feature_importance_values: 5, training_percent: 35, prediction_field_name: 'heatingqc', num_top_classes: -1, max_trees: 10, }, }, model_memory_limit: '60mb', allow_lazy_start: false, } ```