Open leobiscassi opened 1 year ago
The incremental source infers schema by simply loading the dataset from the source table. What you're proposing is a good enhancement. Would you like to take it up? HUDI-5997
@codope yes, I would, but I would probably need some guidance tough (mostly with the build system and how to compile the project on my own, is the docs up-to-date?). Do you think that's possible?
@leobiscassi The dev setup page has all the details to help you with getting started with project. If you face any issues, I can sync up with you over a call.
As for the enhancement, we just need to enforce and set the schema while loading dataset in S3EventsHoodieIncrSource
(the code block that you posted).
@codope awesome, I'm going to start to work on this today and let you know in case I face some road blocker.
@codope I've commented in the jira ticket with some questions, I think it's a better place to have the discussion, that way it's easier for other people to look in the future. Thanks in advance.
Makes sense. Have updated the ticket.
Describe the problem you faced
I am running a delta streamer job to ingest JSON files from S3 using the
S3EventsHoodieIncrSource
. In this use case, I need to enforce the schema in the source files because there may or may not be some fields depending on certain occasions. According to the docs, I can do this using thehoodie.deltastreamer.schemaprovider.source.schema.file
parameter, but it doesn't seem to be working.Although the documentation states that "For sources that return Dataset, the schema is obtained implicitly. However, this CLI option allows overriding the schema provider returned by Source"
, this does not seem to apply to the specific source being referred to. Upon examining this piece of code, it appears that the informed schema is not being explicitly set.
If I inform a source schema using the parameter
hoodie.deltastreamer.schemaprovider.source.schema.file
, I expect that the schema will be enforced over all the files read in the job. Is it appropriate to consider this a bug? Should I fill a bug ticket on Jira?P.S: If my assumptions and analysis are right, I'd have interest in submitting a fix for this, since this issue is affecting my workloads 😄
Environment Description
This is happening in all hudi versions that I tested >= 0.9, I have jobs running with 0.9 and 0.11 on EMR.