[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release

Junyewu commented 1 year ago

Describe the problem you faced

with the HoodieROTablePathFilter load normal parquet file, it will be too slow when reaches a certain order of magnitude

For example：500 partitions and 50000 data files

data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}

To Reproduce Steps to reproduce the behavior:

submit spark application

spark-sql --master yarn \
--conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter

create temp view

create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");

Then slow load occurs

Environment Description

Hudi version : 0.10.0
Spark version : 3.1.1
Hive version : 3.1.2
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

use the PR [https://github.com/apache/hudi/pull/3719] will mitigate this problem，again run

create or replace temporary view {user_view} using parquet options (path "s3://bucket1/{baseDir}/");

can finished in about 60 seconds

22/12/09 14:01:41 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.

Time taken: 61.771 seconds

At the same time，we have not repeated the problem [https://github.com/apache/hudi/issues/4188]. In our spark cluster，[HUDI-3719] this PR has used to query partition tables for half a year，such as：

==create table==
CREATE EXTERNAL TABLE `pickinglogs`(
  `_hoodie_commit_time` string COMMENT '',
  `_hoodie_commit_seqno` string COMMENT '',
  `_hoodie_record_key` string COMMENT '',
  `_hoodie_partition_path` string COMMENT '',
  `_hoodie_file_name` string COMMENT '',
  `id` string COMMENT 'ID',

.......

  `meta_es_offset` string COMMENT '',
  `meta_type` string COMMENT '',
  `meta_status` int COMMENT '',
  `meta_md5` string COMMENT '',
  `ptk_time_create` string COMMENT '')
PARTITIONED BY (
  `year` string COMMENT '',
  `month` string COMMENT '',
  `day` string COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION

==query for sparksql==
spark-sql> select count(id) from pickinglogs where year=2022 and month between '08' and '10';
441834287
Time taken: 22.095 seconds, Fetched 1 row(s)

Stacktrace

Junyewu commented 1 year ago

@zhangyue19921010

Junyewu commented 1 year ago

Introduced the following PR in hudi-0.11.0-rc1 will again causes slow load issues. [HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (https://github.com/apache/hudi/pull/4014)

When i removed that code in hudi-0.12.1，the slow load problem was alleviated.

      if (baseDir != null) {
        // Check whether baseDir in nonHoodiePathCache
//        if (nonHoodiePathCache.contains(baseDir.toString())) {
//          if (LOG.isDebugEnabled()) {
//            LOG.debug("Accepting non-hoodie path from cache: " + path);
//          }
//          return true;
//        }

Example:

==submit application==
# sudo -u hive spark-sql --master yarn --conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter --jars s3://bucket1/hudi-spark3.1-bundle_2.12-0.12.1.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/12 16:02:04 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
22/12/12 16:02:05 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
22/12/12 16:02:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/12/12 16:02:14 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
22/12/12 16:02:14 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark master: yarn, Application Id: application_1660903282590_9234

==run load==
spark-sql> create or replace temporary view src_order_cate_query using parquet options(path 's3://bucket1/search_offline/src_order_cate_query/');    //this path have 570 partitions, 5.4w parquet files
22/12/12 16:04:02 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
Time taken: 97.353 seconds

zhuoluoy commented 1 year ago

Actually, for legacy MapReduce, This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower.

Can we just brinig back https://github.com/apache/hudi/pull/3719 and fix the NPE?

zhuoluoy commented 1 year ago

Should we open an Apache JIRA for this?

apache / hudi

[SUPPORT] With HoodieROTablePathFilter is too slow load normal parquets in hudi release #7417