Open Junyewu opened 1 year ago
@zhangyue19921010
Introduced the following PR in hudi-0.11.0-rc1 will again causes slow load issues. [HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (https://github.com/apache/hudi/pull/4014)
When i removed that code in hudi-0.12.1,the slow load problem was alleviated.
if (baseDir != null) {
// Check whether baseDir in nonHoodiePathCache
// if (nonHoodiePathCache.contains(baseDir.toString())) {
// if (LOG.isDebugEnabled()) {
// LOG.debug("Accepting non-hoodie path from cache: " + path);
// }
// return true;
// }
Example:
==submit application==
# sudo -u hive spark-sql --master yarn --conf spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter --jars s3://bucket1/hudi-spark3.1-bundle_2.12-0.12.1.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/12 16:02:04 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
22/12/12 16:02:05 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
22/12/12 16:02:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/12/12 16:02:14 WARN Utils: spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.
22/12/12 16:02:14 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark master: yarn, Application Id: application_1660903282590_9234
==run load==
spark-sql> create or replace temporary view src_order_cate_query using parquet options(path 's3://bucket1/search_offline/src_order_cate_query/'); //this path have 570 partitions, 5.4w parquet files
22/12/12 16:04:02 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
Time taken: 97.353 seconds
Actually, for legacy MapReduce, This patch is very important. Without this patch, HoodiROTablePathFilter will be thousands times slower.
Can we just brinig back https://github.com/apache/hudi/pull/3719 and fix the NPE?
Should we open an Apache JIRA for this?
Describe the problem you faced
with the HoodieROTablePathFilter load normal parquet file, it will be too slow when reaches a certain order of magnitude
For example:500 partitions and 50000 data files
data path: s3://bucket1/{baseDir}/{partitionDir}/{partitionDir}/{data file}
To Reproduce Steps to reproduce the behavior:
submit spark application
create temp view
Then slow load occurs
Environment Description
Hudi version : 0.10.0
Spark version : 3.1.1
Hive version : 3.1.2
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
use the PR [https://github.com/apache/hudi/pull/3719] will mitigate this problem,again run
can finished in about 60 seconds
At the same time,we have not repeated the problem [https://github.com/apache/hudi/issues/4188]. In our spark cluster,[HUDI-3719] this PR has used to query partition tables for half a year,such as:
Stacktrace