apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 980 forks source link

Parquet TIMESTAMP_MICROS columns in WHERE clauses #2788

Closed handmadecode closed 1 year ago

handmadecode commented 1 year ago

Describe the bug We are using Drill with parquet files where the timestamp columns are in microseconds. When those columns are displayed, Drill converts the microsecond values to milliseconds. However, when using a timestamp column in WHERE clauses it looks like the original microsecond value is used instead of the adjusted millisecond value when filtering records.

To Reproduce Assume a parquet file a directory "Test" with a column timestampCol having the type org.apache.parquet.schema.OriginalType.TIMESTAMP_MICROS.

Assume there are two records with the values 1673981999806149 and 1674759597743552, respectively, in that column (i.e. the UTC dates 2023-01-17T18:59:59.806149 and 2023-01-26T18:59:57.743552)

  1. Execute the query SELECT timestampCol FROM dfs.Test; The result includes both records, as expected.

  2. Execute the query SELECT timestampCol FROM dfs.Test WHERE timestampCol < TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss') This produces an empty result although both records have a value less than the argument.

  3. Execute SELECT timestampCol FROM dfs.Test WHERE timestampCol > TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss') The result includes both records although neither have a value greater than the argument.

Expected behavior The query in 2) above should produce a result with both records, and the query in 3) should produce an empty result.

Error detail, log output or screenshots No errors in the logs

Drill version 1.21.0 and master (as of commit 9c401c6).

Additional context Even timestamps long into the future produce results with both records, e.g.: SELECT timestampCol FROM dfs.Test WHERE timestampCol > TO_TIMESTAMP('2502-04-04 00:00:00', 'yyyy-MM-dd HH:mm:ss')

Manually converting the timestamp column to milliseconds produces the expected result: SELECT timestampCol FROM dfs.Test WHERE TO_TIMESTAMP(CONVERT_FROM(CONVERT_TO(timestampCol, 'TIMESTAMP_EPOCH'), 'BIGINT')/1000) < TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss') produces a result with both records.

handmadecode commented 1 year ago

I had a look at the code and my theory is that the problem is the initial scan of the parquet files. This code get the column's min and max values from the parquet file's meta data and compares the millisecond argument from the WHERE clause to the microsecond min/max values without converting them. This causes all files to be filtered out at this stage.

I did a quick test of this theory and added the following to org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector at line 211:

        if (columnTypeMetadata.originalType == OriginalType.TIMESTAMP_MICROS) {
          minValue = Long.valueOf(((Number) minValue).longValue() / 1000);
          maxValue = Long.valueOf(((Number) maxValue).longValue() / 1000);
        }

in context, the new code looks like this from line 203:

      if (!stats.isEmpty() && stats.hasNonNullValue()) {
        minValue = stats.genericGetMin();
        maxValue = stats.genericGetMax();
        if (containsCorruptDates == ParquetReaderUtility.DateCorruptionStatus.META_SHOWS_CORRUPTION
          && columnTypeMetadata.originalType == OriginalType.DATE) {
          minValue = ParquetReaderUtility.autoCorrectCorruptedDate((Integer) minValue);
          maxValue = ParquetReaderUtility.autoCorrectCorruptedDate((Integer) maxValue);
        }
        if (columnTypeMetadata.originalType == OriginalType.TIMESTAMP_MICROS) {
          minValue = Long.valueOf(((Number) minValue).longValue() / 1000);
          maxValue = Long.valueOf(((Number) maxValue).longValue() / 1000);
        }
      }
      long numNulls = stats.getNumNulls();
      Metadata_V4.ColumnMetadata_v4 columnMetadata = new Metadata_V4.ColumnMetadata_v4(columnTypeMetadata.name,
          primitiveTypeName, minValue, maxValue, numNulls);
      columnMetadataList.add(columnMetadata);
      columnTypeMetadata.isInteresting = true;

From my limited testing this could potentially fix the problem. I am however new to this code base and would appreciate some comments on this change. Maybe there are implications I don't fully understand, or perhaps there is a better way of fixing this?

Should the fix look good I'd be happy to create a pull request with some test cases.

jnturton commented 1 year ago

Thank you, it's always nice to receive a high quality report like this. Drill does only support millisecond precision in its TIMESTAMP type and it does handle timestamps that have greater precision by flooring to the nearest millisecond (or by sending them to a BIGINT instead). So I think your fix is on the right track. This definitely merits an issue in the Apache Jira. IIRC, to combat spammers you may now have to request a Jira account which we approve but we'll keep an eye open for that. Once we've got a DRILL-nnnn number we can base your PR on it and I'll take a closer look at the fix in review. We can also close this GitHub issue wtih a comment like "Promoted to DRILL-nnnn".

handmadecode commented 1 year ago

Thanks for the quick feedback. I have requested a Jira account and verified my email address.

handmadecode commented 1 year ago

Promoted to DRILL-8421 See https://issues.apache.org/jira/browse/DRILL-8421