Closed elasticmachine closed 6 years ago
Original comment by @makwarth:
FYI @droberts195 @sophiec20 @stevedodson
Original comment by @droberts195:
Is there an API to delete historic ML results? Do we need it? (Probably not)
We have a results_retention_days
setting in the job config - search for it in https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-job-resource.html
So there's no API to request results be immediately deleted (other than deleting the entire job), but you can control how long after creation the automatic cleanup process will delete them.
Original comment by @makwarth:
Got it, thanks @droberts195
Original comment by @makwarth:
@jgowdyelastic Hey, just checking in regarding the added feature of attaching a query to the ML endpoint? (To filter by APM service in the APM index)
Original comment by @jgowdyelastic:
@makwarth PR LINK REDACTED was merged this morning. So it is now possible to override the query
object when calling our setup
endpoint.
Setup module items
POST
{
"prefix": "new_",
"indexPatternName": "filebeat-*",
"query": {
"bool": {
"filter": [{
"term": {
"fileset.module": "nginx"
}
},
{
"term": {
"fileset.name": "access"
}
}
]
}
}
}
The prefix specifies optional characters to be prepended to the beginning of all the jobs ids.
Original comment by @makwarth:
@jgowdyelastic Oh, terrific! Thanks /cc @sqren
Original comment by @makwarth:
Update after call with @blaklaybul re the ML/APM job:
cc @roncohen @formgeist
Original comment by @formgeist:
I've updated my design card with some initial screens and a clickable prototype in InVision. This card is referenced, so perhaps we can close, keep the discussion going in the design card until we start implementation? LINK REDACTED
Original comment by @stevedodson:
@blaklaybul - based on the datasets we've analysed so far, we should validate if 1m bucketspan is effective and limitations of this generically.
For example, for the APM data we demoed at Elastic{ON}18, I get these results with a 1m and a 15m bucketspan:
Zooming into the anomaly at 1m on the right hand side I get:
at 1m and
at 15m.
Some comments:
(this also leads into thoughts around baselines responsetimes from different endpoints in addition to this - to help mitigate the general variance)
Ideally we can experiment with the optimal configuration on a corpus of diverse real data. Until this is available, it may be better to implement a job based on a longer bucketspan - potentially with interim results?
Original comment by @tveasey:
The best approach we have available to us at the moment for dealing with the effect of event rate variation on the 95th percentile is to use the mean aggregation and pass the summary count field along with the value. This isn't quite right, but we don't have a better options available without having a native percentile function in the backend. Hopefully, appropriate partitioning of the data (into classes of similar requests) should increase homogeneity of requests and mitigate this sort of problem.
Longer bucket lengths should also help with this problem: they are more likely to contain a representative sample of the population. Also, beware that the 95th percentile will be more susceptible to this than some other statistics since the outlying requests only need to constitute 5% of the bucket values to move this statistic. You may want to use longer bucket lengths for this statistic compared to say the median.
Going forward we could also consider creating a new rule type which excludes low count buckets altogether (this will need to be scale invariant, i.e. expressed as a function of mean bucket rate) if this proves to be important for APM. Any feedback here would be useful.
Original comment by @stevedodson:
i.e.
"analysis_config": {
"bucket_span": <bucket_span>,
"summary_count_field_name": "doc_count",
"detectors": [
{
"detector_description": "high 95th percentile of transaction duration",
"function": "high_mean",
"field_name": "transaction.duration.us"
}
]
}
...
"aggregations": {
"buckets": {
"date_histogram": {
"field": EMAIL REDACTED
"interval": <bucket_span>
},
"aggregations": {
"transaction.duration.us": {
"percentiles": {
"field": "transaction.duration.us",
"percents": [
95
]
}
}
}
}
}
Closing this in favour of https://github.com/elastic/kibana/issues/18569 which summarizes the above discussion.
Original comment by @makwarth:
Updated March 23, 2018 after call with: @jgowdyelastic @blaklaybul @sqren @formgeist
Status
The ML team has created four ML jobs for APM:
apm-high_response_time
apm-unusual_errors
apm-unusual_request_rate
apm-unusual_users
The ML team has also created a know configuration for APM, so it's easy to get started with any of the above jobs. In itself these jobs are super useful, but this issue is focusing on integrating the ML
apm-high_response_time
job in the APM UI. Later we could have UI integrations for all four jobs.On our call, we also discussed possibility of a fifth job to detect anomalous transactions with regards to their span count or shift in which span types that account for the most time.
First APM UI / ML integration feature: Dynamic Baselines
We want to integrate with ML to provide users with an (opt-in) dynamic baseline on service response time graphs in the APM UI. This will enable users to tell, if the current performance is nominal expected or abnormal. To enable this feature, platinum users will simply click a button to enable the ML job for the active APM service. Stretch goal is to have this feature done by 6.4.
Mockup of step 1:
Mockup of step 2:
High-level todo
ml-anomaly
index (instead of dedicated index)<ml-job-name>-apm-<apm-service-name>
.Please comment / add on stuff I've forgotten.
TBD