Open dataproblems opened 3 days ago
There is a check when populateMetaFields
is disabled. And I see you table config option is set up as hoodie.populate.meta.fields=false
.
if (tableConfig.populateMetaFields()) {
HoodieRecord.RECORD_KEY_METADATA_FIELD
} else {
val keyFields = tableConfig.getRecordKeyFields.get()
checkState(keyFields.length == 1)
keyFields.head
}
@danny0405 - are you saying that I need to set hoodie.populate.meta.fields=true during the bulk insert operation?
@danny0405 - are you saying that I need to set hoodie.populate.meta.fields=true during the bulk insert operation?
yes, if your primary key fields are multiple.
@danny0405 - When I enabled that for the table that had multiple fields in the record key, I notice that the bulk insert operation is taking an unreasonably long time. Something that took 10 ish minutes before this change ran for over 2 hours and failed with the following exception:
java.io.EOFException: Unexpected EOF while trying to read response from server
at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:538) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1137) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
24/10/08 19:19:15 WARN DataStreamer: DataStreamer Exception
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:1.8.0_422]
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[?:1.8.0_422]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:1.8.0_422]
at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[?:1.8.0_422]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) ~[?:1.8.0_422]
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:62) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:141) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:158) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:116) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_422]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_422]
at java.io.DataOutputStream.flush(DataOutputStream.java:123) ~[?:1.8.0_422]
at org.apache.hadoop.hdfs.DataStreamer.sendPacket(DataStreamer.java:858) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.hdfs.DataStreamer.sendHeartbeat(DataStreamer.java:876) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:675) ~[hadoop-client-api-3.3.3-amzn-3.jar:?]
24/10/08 19:23:19 ERROR TransportRequestHandler: Error sending result RpcResponse[requestId=8906630334403057573,body=NioManagedBuffer[buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]]] to /10.0.26.16:52206; closing connection
java.io.IOException: Broken pipe
was this expected? I'm not operating with a lot of data for this test, but if the bulk insert operation takes exponentially higher time with this config, it would not be something that we can use.
It looks like a Hadoop error, is there any cues related to Hudi specifically? The non-metadata field write should be faster but should not be that long, 2 ~ 3x performance gap is expected there.
No - given that it executes for over 2 hours, I would assume that it is stemming from something within hudi. I see this ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@16d3b8c5 aborted.
Hi @dataproblems
Could you please share your spark-submit/pyspark command here. I can see you have mentioned Spark version is 3.3 and in the above error mentioned it is pointed spark3.
I have tested the following sample code and worked without any issues.
Cluster Details:
spark-shell \
--jars packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.0-beta2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' \
--conf spark.ui.port=14040
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.common.table.HoodieTableConfig._
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.common.model.HoodieRecord
import org.apache.hudi.common.table.HoodieTableConfig
import org.apache.hudi.config.HoodieIndexConfig
import org.apache.hudi.common.config.HoodieStorageConfig
import org.apache.hudi.common.config.HoodieMetadataConfig
import org.apache.hudi.execution.bulkinsert.BulkInsertSortMode
import spark.implicits._
val columns = Seq("ts","uuid","rider","driver","fare","city")
val data =
Seq((1695159649087L,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A","driver-K",19.10,"san_francisco"),
(1695091554788L,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","driver-M",27.70 ,"san_francisco"),
(1695046462179L,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","driver-L",33.90 ,"san_francisco"),
(1695516137016L,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","driver-P",34.15,"sao_paulo" ),
(1695115999911L,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","driver-T",17.85,"chennai"));
var inserts = spark.createDataFrame(data).toDF(columns:_*)
val tableName = "trips_table"
val basePath = "file:///tmp/trips_table"
val bulkWriteOptions: Map[String, String] = Map(
DataSourceWriteOptions.OPERATION.key() -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME.key() -> "snappy",
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> "2147483648",
"hoodie.parquet.small.file.limit" -> "1073741824",
HoodieTableConfig.POPULATE_META_FIELDS.key() -> "false",
HoodieWriteConfig.BULK_INSERT_SORT_MODE.key() -> BulkInsertSortMode.GLOBAL_SORT.name(),
HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key() -> "true",
HoodieIndexConfig.INDEX_TYPE.key() -> "RECORD_INDEX",
DataSourceWriteOptions.META_SYNC_ENABLED.key() -> "false",
"hoodie.metadata.record.index.enable" -> "true",
"hoodie.metadata.enable" -> "true",
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.clustering.inline" -> "true",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "2147483648",
"hoodie.clustering.plan.strategy.small.file.limit" -> "1073741824",
"hoodie.datasource.write.partitionpath.field" -> "city",
"hoodie.datasource.write.recordkey.field" -> "uuid",
"hoodie.datasource.write.precombine.field" -> "ts",
"hoodie.table.name" -> tableName
)
inserts.write.format("hudi").
options(bulkWriteOptions).
mode(Overwrite).
save(basePath)
val tripsDF = spark.read.format("hudi").load(basePath)
tripsDF.show(false)
Hi @rangareddy. The packages I use are --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:1.0.0-beta1,org.apache.hudi:hudi-aws:1.0.0-beta1
. Another thing to note is that my recordKey is complex. I'm not sure if that impacts the change.
Are you suggesting that 1.0.0-beta2 should only be used with spark 3.5? ( For managing other dependencies, we're using EMR 6.x which does not have spark 3.5 and supports up till spark 3.4).
I also tried to use the 3.4 spark bundle and got the same exact error while trying to read the table. So it's reproducible on my end.
Describe the problem you faced
I'm creating a hudi table using bulk insert operation and the reader of the table fails with
IllegalStateException
.A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I should be able to read the data back into a dataframe with no exceptions.
Environment Description
Hudi version : 1.0.0-beta2
Spark version : 3.3.2
Hive version :
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Here are the hudi options I'm using for the bulk insert:
Here's the hoodie.properties from the table that was generated using 1.0.0-beta2
Stacktrace