elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.5k stars 24.6k forks source link

Should model snapshot timestamps be changed to model time? #56076

Open droberts195 opened 4 years ago

droberts195 commented 4 years ago

While looking at #52150 and #51061 I noticed that model snapshot timestamps are set using wall clock time, not model time. All other documents in the ML results indices use the timestamp field to store model time.

This doesn't matter very much for jobs that are genuinely running in real-time. However, it makes integration testing hard, and means retention of model snapshots will still not necessarily be intuitive for jobs that are stopped for a long period of time and then restarted.

I also wonder about implications for the new annotations functionality for model snapshots.

It would have been better if we'd made model_snapshot documents use timestamp for model time and another field, say create_time, for the wall clock time of creation. (Similar to how model_size_stats uses timestamp and log_time.) But given where we are we need to decide if making the change now will cause too much complexity in the BWC. And if we do change, how will we migrate from where we are now to the new format without breaking model snapshot retention?

elasticmachine commented 4 years ago

Pinging @elastic/ml-core (:ml)

droberts195 commented 4 years ago

I also wonder about implications for the new annotations functionality for model snapshots.

https://github.com/elastic/elasticsearch/blob/7c5c9122a72352ecf44512f166cd282db3cf7453/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java#L335-L336 suggests the annotation will be in the wrong place for jobs not running in real-time.

It would have been better if we'd made model_snapshot documents use timestamp for model time and another field, say create_time, for the wall clock time of creation.

We do actually have latest_result_time_stamp in model snapshot documents already. So we do have the relevant information, it's just that the fields are the wrong way around compared to all the other documents in our results index.

We should use latest_result_time_stamp as the timestamp for where the annotation appears in the results views, as that is the point the model would roll back to if it were reverted to the model snapshot corresponding to the annotation. Unless I am mistaken about how this works then this needs fixing for 7.8.

As for tidying up for the future, one possibility would be to introduce the new create_time field storing what's currently in timestamp, and for future model snapshots set timestamp to latest_result_time_stamp if that would be ahead of the value used for timestamp in the model snapshot that the C++ process restored, and create_time if not. That would mean timestamp would be even more complex for a while, but as the years passed it would gradually switch to the same meaning that we have in all our other results documents. (Or we could use log_time instead of create_time to use the same field as model_size_stats and avoid creating yet another field in the index.)

przemekwitek commented 4 years ago

I also wonder about implications for the new annotations functionality for model snapshots.

https://github.com/elastic/elasticsearch/blob/7c5c9122a72352ecf44512f166cd282db3cf7453/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java#L335-L336

suggests the annotation will be in the wrong place for jobs not running in real-time.

Thanks for pointing this out. I've just raised https://github.com/elastic/elasticsearch/pull/56093 to address this issue.

It would have been better if we'd made model_snapshot documents use timestamp for model time and another field, say create_time, for the wall clock time of creation.

The bug emerged because I thought timestamp is model time.

We do actually have latest_result_time_stamp in model snapshot documents already. So we do have the relevant information, it's just that the fields are the wrong way around compared to all the other documents in our results index.

We should use latest_result_time_stamp as the timestamp for where the annotation appears in the results views, as that is the point the model would roll back to if it were reverted to the model snapshot corresponding to the annotation. Unless I am mistaken about how this works then this needs fixing for 7.8.

+1 The fix is implemented, see the link above (marked as >non-issue as the feature has not yet been released).

As for tidying up for the future, one possibility would be to introduce the new create_time field storing what's currently in timestamp, and for future model snapshots set timestamp to latest_result_time_stamp if that would be ahead of the value used for timestamp in the model snapshot that the C++ process restored, and create_time if not. That would mean timestamp would be even more complex for a while, but as the years passed it would gradually switch to the same meaning that we have in all our other results documents. (Or we could use log_time instead of create_time to use the same field as model_size_stats and avoid creating yet another field in the index.)