apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.36k stars 2.42k forks source link

[SUPPORT] DFS Schema Provider not working with S3EventsHoodieIncrSource #8211

Open leobiscassi opened 1 year ago

leobiscassi commented 1 year ago

Describe the problem you faced

I am running a delta streamer job to ingest JSON files from S3 using the S3EventsHoodieIncrSource. In this use case, I need to enforce the schema in the source files because there may or may not be some fields depending on certain occasions. According to the docs, I can do this using the hoodie.deltastreamer.schemaprovider.source.schema.file parameter, but it doesn't seem to be working.

Although the documentation states that "For sources that return Dataset, the schema is obtained implicitly. However, this CLI option allows overriding the schema provider returned by Source", this does not seem to apply to the specific source being referred to. Upon examining this piece of code, it appears that the informed schema is not being explicitly set.

    String fileFormat = props.getString(SOURCE_FILE_FORMAT, DEFAULT_SOURCE_FILE_FORMAT);
    Option<Dataset<Row>> dataset = Option.empty();
    if (!cloudFiles.isEmpty()) {
      dataset = Option.of(sparkSession.read().format(fileFormat).load(cloudFiles.toArray(new String[0])));
    }
    return Pair.of(dataset, instantEndpts.getRight());

If I inform a source schema using the parameter hoodie.deltastreamer.schemaprovider.source.schema.file, I expect that the schema will be enforced over all the files read in the job. Is it appropriate to consider this a bug? Should I fill a bug ticket on Jira?

P.S: If my assumptions and analysis are right, I'd have interest in submitting a fix for this, since this issue is affecting my workloads 😄

Environment Description

This is happening in all hudi versions that I tested >= 0.9, I have jobs running with 0.9 and 0.11 on EMR.

codope commented 1 year ago

The incremental source infers schema by simply loading the dataset from the source table. What you're proposing is a good enhancement. Would you like to take it up? HUDI-5997

leobiscassi commented 1 year ago

@codope yes, I would, but I would probably need some guidance tough (mostly with the build system and how to compile the project on my own, is the docs up-to-date?). Do you think that's possible?

codope commented 1 year ago

@leobiscassi The dev setup page has all the details to help you with getting started with project. If you face any issues, I can sync up with you over a call. As for the enhancement, we just need to enforce and set the schema while loading dataset in S3EventsHoodieIncrSource (the code block that you posted).

leobiscassi commented 1 year ago

@codope awesome, I'm going to start to work on this today and let you know in case I face some road blocker.

leobiscassi commented 1 year ago

@codope I've commented in the jira ticket with some questions, I think it's a better place to have the discussion, that way it's easier for other people to look in the future. Thanks in advance.

codope commented 1 year ago

Makes sense. Have updated the ticket.