apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.2k stars 1.21k forks source link

modify TimeValidationTransformer to mark rows as invalid in case primary time column is out of range #11907

Closed gortiz closed 7 months ago

gortiz commented 7 months ago

As said in the title, this PR modifies TimeValidationTransformer to mark rows as invalid in case primary time column is out of range.

TimeValidationTransformer has always verified that the received row contains a value for the primary time column that is between 1971 (inclusive) and 2071 (exclusive). In case the value is outside this range, TimeValidationTransformer does:

  1. set the field to null, which will later be set by NullValueTransformer to the millis since epoch at ingestion time
  2. if tableConfig.ingestionConfig.continueOnError is false, aborts the execution
  3. otherwise log a message in debug level

The log in point 3 is not useful. In case it is disabled, no log is present. In case it is enabled, a log is printed for each invalid row. In cases where the error is present in most rows (like for example when the row contains seconds from epoch but schema is defined as millis from epoch) this log is very spammy.

We already supported a way to mark a row as incorrect which is used by most transformers but not TimeValidationTransformer. This PR modifies TimeValidationTransformer to do so.

Important to know

This PR changes the semantics of the ingestion. Before this PR, if the primary time column contains an invalid value, the field was marked as null and the value of NOW() at ingestion time was stored as default value.

After this PR, if the primary time column contains invalid value, the value is replaced with the default value. If it is real-time consumed, ServerMeter.INCOMPLETE_REALTIME_ROWS_CONSUMED will be increased.

cc @Jackie-Jiang @swaminathanmanish @snleee

codecov-commenter commented 7 months ago

Codecov Report

Merging #11907 (e11947f) into master (55f29fe) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master   #11907   +/-   ##
=========================================
  Coverage     61.42%   61.43%           
+ Complexity     1147     1146    -1     
=========================================
  Files          2375     2375           
  Lines        128530   128531    +1     
  Branches      19853    19853           
=========================================
+ Hits          78944    78957   +13     
+ Misses        43884    43867   -17     
- Partials       5702     5707    +5     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.40% <100.00%> (+0.04%) :arrow_up:
java-21 34.73% <0.00%> (-26.56%) :arrow_down:
skip-bytebuffers-false 61.42% <100.00%> (+0.01%) :arrow_up:
skip-bytebuffers-true 34.71% <0.00%> (-26.55%) :arrow_down:
temurin 61.43% <100.00%> (+<0.01%) :arrow_up:
unittests 61.42% <100.00%> (+<0.01%) :arrow_up:
unittests1 46.65% <0.00%> (+0.01%) :arrow_up:
unittests2 27.58% <100.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...l/recordtransformer/TimeValidationTransformer.java 77.77% <100.00%> (+0.50%) :arrow_up:

... and 16 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more