Open droberts195 opened 4 years ago
Pinging @elastic/ml-core (:ml)
I also wonder about implications for the new annotations functionality for model snapshots.
https://github.com/elastic/elasticsearch/blob/7c5c9122a72352ecf44512f166cd282db3cf7453/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java#L335-L336 suggests the annotation will be in the wrong place for jobs not running in real-time.
It would have been better if we'd made model_snapshot documents use timestamp for model time and another field, say create_time, for the wall clock time of creation.
We do actually have latest_result_time_stamp
in model snapshot documents already. So we do have the relevant information, it's just that the fields are the wrong way around compared to all the other documents in our results index.
We should use latest_result_time_stamp
as the timestamp for where the annotation appears in the results views, as that is the point the model would roll back to if it were reverted to the model snapshot corresponding to the annotation. Unless I am mistaken about how this works then this needs fixing for 7.8.
As for tidying up for the future, one possibility would be to introduce the new create_time
field storing what's currently in timestamp
, and for future model snapshots set timestamp
to latest_result_time_stamp
if that would be ahead of the value used for timestamp
in the model snapshot that the C++ process restored, and create_time
if not. That would mean timestamp
would be even more complex for a while, but as the years passed it would gradually switch to the same meaning that we have in all our other results documents. (Or we could use log_time
instead of create_time
to use the same field as model_size_stats
and avoid creating yet another field in the index.)
I also wonder about implications for the new annotations functionality for model snapshots.
suggests the annotation will be in the wrong place for jobs not running in real-time.
Thanks for pointing this out. I've just raised https://github.com/elastic/elasticsearch/pull/56093 to address this issue.
It would have been better if we'd made model_snapshot documents use timestamp for model time and another field, say create_time, for the wall clock time of creation.
The bug emerged because I thought timestamp
is model time.
We do actually have
latest_result_time_stamp
in model snapshot documents already. So we do have the relevant information, it's just that the fields are the wrong way around compared to all the other documents in our results index.We should use
latest_result_time_stamp
as the timestamp for where the annotation appears in the results views, as that is the point the model would roll back to if it were reverted to the model snapshot corresponding to the annotation. Unless I am mistaken about how this works then this needs fixing for 7.8.
+1
The fix is implemented, see the link above (marked as >non-issue
as the feature has not yet been released).
As for tidying up for the future, one possibility would be to introduce the new
create_time
field storing what's currently intimestamp
, and for future model snapshots settimestamp
tolatest_result_time_stamp
if that would be ahead of the value used fortimestamp
in the model snapshot that the C++ process restored, andcreate_time
if not. That would meantimestamp
would be even more complex for a while, but as the years passed it would gradually switch to the same meaning that we have in all our other results documents. (Or we could uselog_time
instead ofcreate_time
to use the same field asmodel_size_stats
and avoid creating yet another field in the index.)
While looking at #52150 and #51061 I noticed that model snapshot timestamps are set using wall clock time, not model time. All other documents in the ML results indices use the
timestamp
field to store model time.This doesn't matter very much for jobs that are genuinely running in real-time. However, it makes integration testing hard, and means retention of model snapshots will still not necessarily be intuitive for jobs that are stopped for a long period of time and then restarted.
I also wonder about implications for the new annotations functionality for model snapshots.
It would have been better if we'd made
model_snapshot
documents usetimestamp
for model time and another field, saycreate_time
, for the wall clock time of creation. (Similar to howmodel_size_stats
usestimestamp
andlog_time
.) But given where we are we need to decide if making the change now will cause too much complexity in the BWC. And if we do change, how will we migrate from where we are now to the new format without breaking model snapshot retention?