[ML] Client-Side Job Validation

Original comment by @walterra:

This is a follow up to #18349 (Client side job validation for v6.3), where the general validation framework and initial checks have been implements. This issue tracks follow-up updates for job validation.

Framework

[ ] Add a skip option to the API endpoint to skip certain tests
[x] Job validation should be more helpful in explaining why checks passed #19068

Checks

[ ] Revisit cardinality evaluation code and improve the logic concerning by/partition/over fields to determine cardinality
[ ] Check cardinality of fields including model memory estimation
[ ] Check if bucket span significantly different from bucket-estimation
[ ] Check sparse-ness of data
- if sparse, use a sparse-aware function (overlap with bucket span estimator which would suggest a longer bucket)
[x] Check if using scripted fields, don't report them as not being aggregatable #21205
[ ] If using scripted fields, warn that it is not possible to display the anomaly charts
[ ] Check for summary_count_field
- if metric is non-zero integer and we are using a sum function, then perhaps this is actually a summary_count_field
[ ] Check for mix of detectors
- if job contains both rare and metric detectors, warn that you might get better results by splitting into two jobs (tbc - analysis pending)
- if job has many different over_fields, warn that you might get better results by splitting jobs
[ ] Check if the selected timespan contains any data and/or if there's additional data outside the selected timespan
[ ] Check if index names are suitable for ML analysis (e.g. prefix wildcards)
[ ] Check if summary count field is numeric, see #19114
[ ] Check if both categorization_filters and a categorization_analyzer are configured. If so then the message could be "Categorization filters are not permitted with a categorization analyzer. Instead add a char_filter within the categorization_analyzer."

Bugs

[x] elasticsearch error when aggregating _source https://github.com/elastic/kibana/issues/18516
[x] model memory format check is too generous and is sometimes skipped https://github.com/elastic/kibana/issues/18764
[x] doesn't work for jobs using categorization #20867
[x] job validation doesn't check lower bound on model memory limit https://github.com/elastic/kibana/issues/18380
[x] "Dectector" Typo https://github.com/elastic/kibana/pull/25130

Additional Checks take over from #18074

Estimate memory usage
- would be good to do, although it will only be an estimate based on data seen
Estimate resource usage
- if high cardinality, low bucket_span, with many detectors/influencers and depending on function then we can warn if we expect the job to be a resource intensive one

Finally, we could provide an example of the sort of results to expect. This is already somewhat covered by the simple jobs wizards but is lacking from adv job config. We can provide both pictorial and language descriptions for the analysis..

e.g. language descriptions (pseudo config)
Models the sum(bytes) for each Host
Detects unusual behavior for a Host compared to its own past behavior
Gives greater significance if many Hosts are unusual together or
Models the sum(bytes) for the populations of Hosts
Detects unusual behavior for a Host compared to the past behavior of the population

elastic / kibana

[ML] Client-Side Job Validation #18368

Framework

Checks

Bugs

Additional Checks take over from #18074