elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.12k stars 24.83k forks source link

Add back support for negative epoch values for time fields in Elasticsearch #40983

Closed hanoch closed 5 years ago

hanoch commented 5 years ago

Testing with Elasticsearch v7.0.0-rc2, we can't seem to be able to write negative date epoch long values anymore using the epoch_millis date field format. This means that any date value before 1/1/1970 is a problem. We are seeing the following error message: ...failed to parse date field [-12321] with format [epoch_millis]...

Using the following GA versions of Elasticsearch we were and are able to write documents with negative date epoch long values using the epoch_millis date field format - v1.6.x, v1.7.x, v2.3.2, v5.5.0, v5.6.5, v6.4.2, v6.6.1, v6.6.2, v6.7.0, v6.7.1.

We are aware of the workaround to switch to use the ISO8601 string representation of the date values and set the field value to use that date field format rather than the epoch_millis format, but we are concerned about the need to change our code-bases and products from using the epoch long values to the ISO8601 string values because of:

  1. Performance hit needing to convert between Java/Scala Date/Instant objects and the ISO8601 strings. Having the Java/Scala implementation use the epoch long is more performing than having the need to parse (and generate) strings.
  2. We will need to change our implementations and APIs anywhere we need to write and read date field values, as well as query against date fields (see also #39375).
  3. Need to write new special logic to upgrade existing user's data that was created by an older version of Elasticsearch (e.g. v6.4.2, v5.5.0, etc.).

Relates to:

Logs:

Error Message:

Insert operation failed. Error: "WrappedArray([UpsertFieldValue:null]:Map(type -> mapper_parsing_exception, reason -> failed to parse field [IntervalStart] of type [date] in document with id 'XTYw5WkBP1AXtllM8BvD', caused_by -> Map(type -> illegal_argument_exception, reason -> failed to parse date field [-12321] with format [epoch_millis], caused_by -> Map(type -> date_time_parse_exception, reason -> Failed to parse with all enclosed parsers))),[UpsertFieldValue:null]:Map(type -> mapper_parsing_exception, reason -> failed to parse field [IntervalStart] of type [date] in document with id 'XjYw5WkBP1AXtllM8BvD', caused_by -> Map(type -> illegal_argument_exception, reason -> failed to parse date field [-5718] with format [epoch_millis], caused_by -> Map(type -> date_time_parse_exception, reason -> Failed to parse with all enclosed parsers))))".

Stacktrace:

com.esri.rf.HttpException: Insert operation failed. Error: "WrappedArray([UpsertFieldValue:null]:Map(type -> mapper_parsing_exception, reason -> failed to parse field [IntervalStart] of type [date] in document with id 'XTYw5WkBP1AXtllM8BvD', caused_by -> Map(type -> illegal_argument_exception, reason -> failed to parse date field [-12321] with format [epoch_millis], caused_by -> Map(type -> date_time_parse_exception, reason -> Failed to parse with all enclosed parsers))),[UpsertFieldValue:null]:Map(type -> mapper_parsing_exception, reason -> failed to parse field [IntervalStart] of type [date] in document with id 'XjYw5WkBP1AXtllM8BvD', caused_by -> Map(type -> illegal_argument_exception, reason -> failed to parse date field [-5718] with format [epoch_millis], caused_by -> Map(type -> date_time_parse_exception, reason -> Failed to parse with all enclosed parsers))))". at com.esri.bds.dialect.elasticsearch.editor.MultiPointEditorDialectTester.addFeatures(EditorDIalectTester.scala:49)

Standard Output:

2019-04-04T06:53:03,470 | INFO | c.e.b.d.e.e.MultiPointEditorDialectTester:342 | before test 2019-04-04T06:53:03,703 | ERROR | c.e.a.b.e.HttpBDSClient$:2167 | Data source "Wind_Forecast" failed to create or update new documents. Failed to execute the bulk request. Error: {"took":15,"errors":true,"items":[{"index":{"_index":"esri-ds-data_5da50ee9-9c9b-4352-93a0-919cf645d577_es-6-4-2_20190403","_type":"_doc","_id":"WjYw5WkBP1AXtllM8BvD","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":141,"_primary_term":1,"status":201}},{"index":{"_index":"esri-ds-data_5da50ee9-9c9b-4352-93a0-919cf645d577_es-6-4-2_20190403","_type":"_doc","_id":"WzYw5WkBP1AXtllM8BvD","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":142,"_primary_term":1,"status":201}},{"index":{"_index":"esri-ds-data_5da50ee9-9c9b-4352-93a0-919cf645d577_es-6-4-2_20190403","_type":"_doc","_id":"XDYw5WkBP1AXtllM8BvD","_version":1,"result":"created","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":143,"_primary_term":1,"status":201}},{"index":{"_index":"esri-ds-data_5da50ee9-9c9b-4352-93a0-919cf645d577_es-6-4-2_20190403","_type":"_doc","_id":"XTYw5WkBP1AXtllM8BvD","status":400,"error":{"type":"mapper_parsing_exception","reason":"failed to parse field [IntervalStart] of type [date] in document with id 'XTYw5WkBP1AXtllM8BvD'","caused_by":{"type":"illegal_argument_exception","reason":"failed to parse date field [-12321] with format [epoch_millis]","caused_by":{"type":"date_time_parse_exception","reason":"Failed to parse with all enclosed parsers"}}}}},{"index":{"_index":"esri-ds-data_5da50ee9-9c9b-4352-93a0-919cf645d577_es-6-4-2_20190403","_type":"_doc","_id":"XjYw5WkBP1AXtllM8BvD","status":400,"error":{"type":"mapper_parsing_exception","reason":"failed to parse field [IntervalStart] of type [date] in document with id 'XjYw5WkBP1AXtllM8BvD'","caused_by":{"type":"illegal_argument_exception","reason":"failed to parse date field [-5718] with format [epoch_millis]","caused_by":{"type":"date_time_parse_exception","reason":"Failed to parse with all enclosed parsers"}}}}}]} 2019-04-04T06:53:03,705 | INFO | c.e.b.d.e.e.MultiPointEditorDialectTester:371 | after test Standard Error REPRODUCE WITH: gradlew null -Dtests.seed=FEC706F81CEFF67D -Dtests.class=com.esri.bds.dialect.elasticsearch.editor.MultiPointEditorDialectTester -Dtests.method="deleteFeaturesByExtent" -Dtests.security.manager=false -Dtests.locale=sr-RS -Dtests.timezone=Asia/Yakutsk

randallwhitman commented 5 years ago

To add to what Hanoch posted, there is a lot of historical data that one might want to store. One tiny example could be:

Name Date
Charles Babbage 1791-12-26
Ada Lovelace 1815-12-10

As an example at a much larger scale, an organization could very well be interested in extending a dataset like GDELT farther into the past.

jasontedor commented 5 years ago

To add to what Hanoch posted, there is a lot of historical data that one might want to store.

To clarify, you're not prevented from indexing historical data, only that we did not build support for negative timestamps into epoch_millis.

As an example at a much larger scale, an organization could very well be interested in extending a dataset like GDELT farther into the past.

I could be misreading the source material, but it doesn't appear to me that GDELT uses epoch_millis, instead a rudimentary YYYYMMDD format. Again, the fact that epoch_millis does not allow negative timestamps does not imply that dates before 1970 (the epoch) are not supported.

dsshane commented 5 years ago

@jasontedor for existed negative epoch_mills, will the reindex API automatically do the conversion when migrating from 6.x to 7.x?

elasticmachine commented 5 years ago

Pinging @elastic/es-search

hanoch commented 5 years ago

To clarify, you're not prevented from indexing historical data, only that we did not build support for negative timestamps into epoch_millis.

To clarify - the support for for negative timestamps in the epoch_millis format was always there, and was removed with v7.0.0 rc.

As I mentioned in the issue description - We are aware of the workaround to switch to use the ISO 8601 string representation of the date values and set the field value to use that date field format rather than the epoch_millis format, but we are concerned about the need to change our code-bases and products from using the epoch long values to the ISO 8601 string values because of:

  1. Performance hit needing to convert between Java/Scala Date/Instant objects and the ISO 8601 strings. Having the Java/Scala implementation use the epoch long is more performing than having the need to parse (and generate) strings.
  2. We will need to change our implementations and APIs anywhere we need to write and read date field values, as well as query against date fields (see also #39375).
  3. We will need to write new special logic to upgrade existing user's data that was created by an older version of Elasticsearch (e.g. v6.4.2, v5.5.0, etc.).
randallwhitman commented 5 years ago

To clarify, you're not prevented from indexing historical data, only that we did not build support for negative timestamps into epoch_millis.

In addition to merely storing it, we are interested in the performance of date handling.

I could be misreading the source material, but it doesn't appear to me that GDELT uses epoch_millis.

What the GDELT project itself does it immaterial to using it as in illustration of something an organization may want to do - and with performant date handling.

jasontedor commented 5 years ago

What the GDELT project itself does it immaterial to using it as in illustration of something an organization may want to do - and with performant date handling.

It's material in so far as this issue is about negative timestamps and epoch_millis and neither of the examples given are about negative timestamps based on epoch_millis. Without clarifying that, they carried the risk of confusing others into thinking that historical dates before the epoch are not supported, as the comment seemed to suggest. I am trying to make it clear that while support for negative timestamps has been deprecated in 6.7.0 and removed, that does not imply that dates before the epoch are not supported: they are very much still supported. On 7.0.0-rc2 both the example given in the formats they are given in are fully supported.

hanoch commented 5 years ago

If we are to switch our implementation for newly created indices at Elasticsearch v7.x to use the ISO 8601 string format for date field values, what is your recommendation on how to define the date field format in the index (template) mapping?

We are thinking to remove the date format from the index (template) mapping to allow it to default to the default "strict_date_optional_time||epoch_millis" value, but looking for some kind of confirmation that that approach is the correct and recommended one from both performance as well as functionality (support ISO 8601 string date values as well as positive epoch_millis long date values) point of view.

Is the 8601 string date format as performing as the epoch_millis long date format, from an index & query/search time point of view?

rjernst commented 5 years ago

Should we not define the format for date fields to allow the format to default to the default format "strict_date_optional_time||epoch_millis"?

Based on your later comment, it seems you are planning to sometimes pass epoch millis, and sometimes pass ISO 8601 dates? If that is the case, and epoch millis would be used for most dates (all dates after 1970), you might consider still setting the format, but reversing the order so that epoch_millis is checked first.

Should we set the date field format to another specific format to be able to use ISO 8601 string date values in the most performant index and search way?

We already recommend doing your own performance tests, as each use case is different. Switching the ordering of formats as suggested above could help with performance depending on the workload.

Regarding nano-second precision - will the default format allow us to use nano-second date values or is it a completely different field type ("date_nanos")?

Correct, the date_nanos field type is required to use nanosecond precision. You can, however, set date formats on the date_nanos field type just as with the date type.

hanoch commented 5 years ago

Some clarifications:

Clarification 1

Based on your later comment, it seems you are planning to sometimes pass epoch millis, and sometimes pass ISO 8601 dates? If that is the case, and epoch millis would be used for most dates (all dates after 1970), you might consider still setting the format, but reversing the order so that epoch_millis is checked first.

If we are to always only pass ISO 8601 dates (since we are not sure if the data might or might not have dates before 1970), what would be your recommendation:

  1. Should we not define the format for date fields to allow the format to default to the default "strict_date_optional_time||epoch_millis" format , or
  2. Set the format to "strict_date_optional_time" (is that the most performing format that supports the ISO 8601 string date values?), or
  3. Set the format to some other better performing recommended format that supports the ISO 8601 string date values?

Clarification 2

For Indices we know we don't need to support date values before 1970, which format is more performing - epoch_millis or the ISO 8601 string format (strict_date_optional_time ???)?

rjernst commented 5 years ago

I can't give any concrete recommendation based on performance, as I have not done any performance tests of these date formats. It is hard to say whether either would be any faster than the other in practice. It would depend heavily on things like ingestion rate, document size, number of mapped fields, etc. This is why we recommend doing your own performance tests.

As for whether to set one or many formats, that completely depends on if you will use the multiple formats. And for the order, I would put the most commonly used first.

elasticmachine commented 5 years ago

Pinging @elastic/es-core-infra

Jamalarm commented 5 years ago

This issue has also hit us. We usually store epoch_millis values well after 1970, but an app user of ours decided to change their device clock to 1920 for some reason. This change is a real problem for us as we want to store and analyse these outliers without having to write a bunch of extra code around it.

  1. We now have to write some logic on all of our date fields so "if the incoming value is negative, store it as a ISO string rather than a long". This is messy and pollutes the field types of our POJOs (they can't just be long values anymore, they need to somehow store both options).
  2. Similarly, on our reading side, we have to do the opposite "if this date field comes back as a string, convert it back to a long" in order to make our API consistent for our consumers.

Given that the dates are stored as longs internally anyway, it seems really silly to have to do all this hacking on the outside for older dates. A workaround could be to simply always use strings, but this would add a overhead of constantly printing and parsing ISO datetime strings that is simply not needed.

Is there any chance we could get this regression addressed?

hanoch commented 5 years ago

What is the status of this issue in the 7.3.0 release?

jasontedor commented 5 years ago

@hanoch There is no change in the 7.3.0 release.

rjernst commented 5 years ago

@hanoch @Jamalarm We really appreciate the feedback on date usage; it is always welcome. Early feedback especially helps us prevent surprising changes that slip through without proper communication. In this case, we probably should have done a better job before 7.0 identifying this as a potential pain point for upgrade, and we are sorry for that.

I realized in reading back through this issue I had never given a proper explanation of what caused the change in the first place. The underlying issue is a difference in how Joda and Java 8 represent instants. Joda internally represented these as milliseconds, while java represents epoch seconds and fractions separately, where the fraction is always positive. Due to how Java 8 date parsers work, these values are parsed independently, so we have no (easy) way of keeping track of negative fractional time (less than one second before epoch) because the negative is lost from the 0 seconds. When originally working on the migration to java 8 time, I had an idea for how to work around this, but it would require several hundred lines of code and many more for tests as well.

While we understand that such an edge case an be frustrating, we also think negative epoch values in general are uncommon. Since ISO date times (and other formats) still work to represent dates both before and after epoch, we don't think the overhead of that additional code is worth the maintenance complexity. Thus, we do not plan to make any change to this behavior. Given that, I hope you don't mind that I close this issue. 


hanoch commented 5 years ago

@rjernst So according to your explanation above - the only limitation of using epoch_millis values are for values that represent milliseconds of the last second before the epoch (the last second of the year 1970 UTC). These values are blocked from being indexed, but all other positive and negative epoch_millis values are allowed. Please confirm.

rjernst commented 5 years ago

Sorry for the confusion @hanoch. All negative values are disallowed, because of this edge case in negative handling.

jmaasing commented 4 years ago

I got hit by this. We have epoch millis in our source data, representing points in time before 1970. Nowhere in

https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html

is it mentioned that negative epoch millis throws errors :-( could maybe documentation be improved?

Ahnfelt commented 3 years ago

While we understand that such an edge case an be frustrating, we also think negative epoch values in general are uncommon.

I would just like to suggest that epoch values are likely the most common representation of UTC dates (Java System.currentTimeMillis(), JavaScript Date.now()), and dates before 1970 are also quite common.

It seems non-obvious that you have to choose a different representation for dates before 1970.