Open asdf2014 opened 3 weeks ago
@asdf2014 , we already support validation of intervals:
lateMessageRejectionPeriod
and earlyMessageRejectionPeriod
.
Docs: https://druid.apache.org/docs/latest/ingestion/supervisor#io-configuration__time
in the WHERE clause__time
column.
Docs: https://druid.apache.org/docs/latest/querying/filters/#filtering-on-the-timestamp-columnDo you want to just filter out such records (which is already supported as listed above) or also raise an alert when an out-of-range record is encountered?
Hi @kfaraz , Apache Druid certainly supports checking data dates. This proposal is about checking at the Task's Payload level because we have encountered errors in filling out intervals on business side, which led to reading a large amount of data from HDFS. It is not the same level of checking as what you mentioned :sweat_smile:
Description
In Apache Druid, we need to support a new feature that can check the interval range to avoid cases where the year is inappropriately entered.
Specifically, when dealing with time data, there are instances where incorrect years are entered due to typos or other reasons. For example, entering the year as 20240 instead of 2024. These incorrect years can lead to significant deviations in data processing and analysis results, affecting the accuracy and reliability of the data.
To avoid such situations, we plan to add an interval range check feature in Apache Druid. This feature will allow users to set a reasonable range for years, such as from the year 2000 to 2100. During data input and processing, the system will automatically check whether the year falls within this range. If a year outside this range is detected, the system will issue a warning or error message, prompting the user to make corrections.
The implementation of this new feature will include the following steps:
By introducing this interval range check feature, we can effectively avoid data issues caused by incorrect year entries, enhancing the accuracy and reliability of data processing. This will provide users with higher quality data analysis services, ensuring that their decisions are based on accurate and error-free data.