delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.49k stars 1.68k forks source link

[Flink] Runtime of Flink test suite is too long #2728

Open tlm365 opened 7 months ago

tlm365 commented 7 months ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

I looked at recent PRs and noticed that the Flink test suite runtime (of all PRs) was too long. Then I checked the logs of some PRs from Github Actions, I can see 2 Flink related issues here: (1) There are many ERROR logs related to package import errors, for example:

2024-03-07T02:17:54.4899309Z [error] /home/runner/work/delta/delta/connectors/flink/src/main/java/io/delta/flink/sink/DeltaSink.java:21:1:  error: package io.delta.flink.sink.internal does not exist
2024-03-07T02:17:54.4915526Z [error] /home/runner/work/delta/delta/connectors/flink/src/main/java/io/delta/flink/sink/RowDataDeltaSinkBuilder.java:21:1:  error: package io.delta.flink.internal.options does not exist
2024-03-07T02:17:54.4937591Z [error] /home/runner/work/delta/delta/connectors/flink/src/main/java/io/delta/flink/source/DeltaSource.java:3:1:  error: package io.delta.flink.internal.options does not exist

(2) There are some ERROR logs related to networking, for example:

2024-03-07T02:20:18.5477483Z [info] 2024-03-07 02:20:18 WARN  Task:1091 - Source: delta-source -> Map (4/4)#0 (3c61a16633bec24402cf3677371d70f7_cbc357ccb763df2852fee8c4fc7d55f2_3_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 859ea617655f9ffcac469b4409a54ff7.
...
2024-03-07T02:20:48.5324962Z [info] 2024-03-07 02:20:48 WARN  SplitFetcherManager:214 - Failed to close the source reader in 30000 ms. There are still 1 split fetchers running
2024-03-07T02:20:49.1227667Z [info] 2024-03-07 02:20:49 WARN  NettyTransport:119 - Remote connection to [localhost/127.0.0.1:37841] failed with java.io.IOException: Connection reset by peer

Steps to reproduce

Go to Github Actions Workflow of any PR pass all unit tests. Then "Download log archive" to get the log. For the log example above, I get it from here

Observed results

I have listed the error above.

Expected results

Error handling, at least import errors. Reduce running time of Flink test suite.

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

tlm365 commented 7 months ago

@vkorukanti Would you take a look pls?

vkorukanti commented 7 months ago

@tlm365 Thanks for creating this issue. It is a known issue and unfortunately, I don't have context to RCA this.

cc. @scottsand-db @nicklan

scottsand-db commented 7 months ago

Hi @tlm365 - thanks for making this issue.

There are many ERROR logs related to package import errors, for example:

These are actually just part of our javadoc generation, and is due to having public classes and interfaces (that we do generated javadoc for) import internal classes for implementation (which we excluded from javadoc).

My question to you is: are you concerned that these are truly dangerous errors? Or is the volume of print statements the concern here? is though printing to the log is slow?

(2) There are some ERROR logs related to networking, for example:

This is new to me. I'll take a look.

My Overall Comments

Overall, I totally agree that our flink tests are slow. Our spark tests are slow, too. We need to scale and parallelize our testing infra.

tlm365 commented 7 months ago

My question to you is: are you concerned that these are truly dangerous errors? Or is the volume of print statements the concern here? is though printing to the log is slow?

This is new to me. I'll take a look.

Hi @scottsand-db, thank you so much. I have no concerns if it is for javadoc gen.