apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.4k stars 3.68k forks source link

Check interval range to avoid cases where year is inappropriately entered #16945

Open asdf2014 opened 3 weeks ago

asdf2014 commented 3 weeks ago

Description

In Apache Druid, we need to support a new feature that can check the interval range to avoid cases where the year is inappropriately entered.

Specifically, when dealing with time data, there are instances where incorrect years are entered due to typos or other reasons. For example, entering the year as 20240 instead of 2024. These incorrect years can lead to significant deviations in data processing and analysis results, affecting the accuracy and reliability of the data.

To avoid such situations, we plan to add an interval range check feature in Apache Druid. This feature will allow users to set a reasonable range for years, such as from the year 2000 to 2100. During data input and processing, the system will automatically check whether the year falls within this range. If a year outside this range is detected, the system will issue a warning or error message, prompting the user to make corrections.

The implementation of this new feature will include the following steps:

  1. Define a reasonable year range: Users can set a reasonable year range through configuration files or the interface.
  2. Data input check: During the data input phase, the system will check whether the year of each data entry falls within the set range.
  3. Data processing check: During the data processing phase, the system will also perform year checks to ensure that all processing data years are within the reasonable range.
  4. Error handling and notification: If a year outside the range is detected, the system will log the error and issue a warning or error message to the user.

By introducing this interval range check feature, we can effectively avoid data issues caused by incorrect year entries, enhancing the accuracy and reliability of data processing. This will provide users with higher quality data analysis services, ensuring that their decisions are based on accurate and error-free data.

kfaraz commented 1 week ago

@asdf2014 , we already support validation of intervals:

Do you want to just filter out such records (which is already supported as listed above) or also raise an alert when an out-of-range record is encountered?

asdf2014 commented 6 days ago

Hi @kfaraz , Apache Druid certainly supports checking data dates. This proposal is about checking at the Task's Payload level because we have encountered errors in filling out intervals on business side, which led to reading a large amount of data from HDFS. It is not the same level of checking as what you mentioned :sweat_smile: