AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
176 stars 90 forks source link

Spark 3.4 support #705

Closed wajda closed 1 day ago

wajda commented 1 year ago

there seems to be binary incompatible changes in API of Delta and Spark SQL that Spline core compiled against Spark 2.4 version cannot work with. E.g. RDDPlugin

Todo:

  1. Check how many places in code that actually incompatible and if there is a quick simple way to fix it.
  2. Otherwise, we'll probably need to move toward Spark version specific builds (#604)
cerveada commented 10 months ago

Regarding Delta: As I understand it, they dropped support for using Delta without DSV2, which means Spark users will need to migrate to DSV2 Delta, but for us, it's no problem. All DeltaDSV2Spec tests are still passing.

cerveada commented 10 months ago

Integration tests not working:

Unit tests not working:

Neelotpaul commented 9 months ago

Hi @wajda @cerveada any timelines for the agent to be available?

wajda commented 9 months ago

Sorry, no :( We are completely buried with another work. Upgrading to Spark 3.4 is not a priority for our organisation at the moment. If somebody from the community is up for implementing it we would be happy to accept pull-requests and release an upgrade.

cerveada commented 9 months ago

Some work is already done in the linked draft PR.

Neelotpaul commented 8 months ago

Thanks @wajda @cerveada for the update. It would be great if you could notify when the development is completed. I will also check the feasibility of our development team on building on the pull-request.

wajda commented 8 months ago

Unfortunately we can't give any ETA on this due to reprioritisation and team capacity. As Adam said some work has started in the draft PR #739. It needs to be tested, it might require adding a separate bundle module for 3.4 and potentially other fixes. If you could help with it that would be amazing. Any questions please ask.

cerveada commented 6 months ago

There were some question about what needs to be done to support the new Spark version. So it comes down to two things:

You can use #459 as inspiration for this. The current pr #739 is already solving part of the issues.

wajda commented 6 months ago

create pom for the new Spark uber jar

just a tip: we use the https://github.com/wajda/jars-pommefizer to generate a pom.xml form a downloaded Apache Spark distribution. Then just make some minor corrections in the POM manually, by comparing it with another similar POM in the project.

rycowhi commented 4 months ago

Hey there @wajda / @cerveada - not looking to make any promises here until I understand the full amount of work that might need to be done here after going through the above comments.

If we can get a full build running successfully based off of #739 via the below, what else is there left to do?

mvn clean test -Pspark3.4 

Edit: To add some clarity here, do all the tests passing in this profile address @cerveada 's concern?

run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.

I see

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

Thanks in advance!

cerveada commented 4 months ago

By all tests, I meant, all unit tests and also all tests in integration-tests maven module. If you run mvn clean test -Pspark3.4 in the root folder of repo, it should run all of them.

We use teamcity for ci, We can modify it ourselfs when this is ready.

cerveada commented 4 months ago

In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.

The test must validate that when you write to a data source and then read from it the URI will be identical. To simulate this I do somethin like this:

df = read(A)
df.createTable(B) // I compare this uri, but now it's A not B

df2 = read(B) // and this uri

I think the issue there is that the Spark will now give you the URI of the original data (A), not the artificial table (B) created from it. So it must be somehow improve or modified to test the same thing as before.

Hope it makes sense, I don't remember the actual issue. But from what I wrote here before I think this is it.

rycowhi commented 4 months ago

@cerveada Thanks this is helpful.

WRT the BasicIntegrationTests issue I found something interesting.

I'm still learning my way around the codebase but I gathered I could find differences by running the test in 3.3 then in 3.4, while printing out the logical plan in LineageHarvester with a quick print statement.

Interestingly enough, they look pretty close to the same, with a few new fields added in 3.4

Here comes the fun part - my print in LineageHarvester is done twice on this test in 3.3, but four times in 3.4!

In 3.3 this makes sense - the test runs and creates two lineage events since there were two writes. There are two CreateDataSourceTableAsSelectCommand commands in the output.

In 3.4 it gets weird - the same two events are above, but now two additional logical plans are created as well! Each write has an additional InsertIntoHadoopFsRelationCommand that corresponds to data being inserted. Some of the semantics around this look a little different from the regular op mappings.

The test is then failing because the lineage captor for the second write is actually getting the second event for the first write. If I ignore the second event (by calling another captor) it actually passes! I don't know if this is the right thing to do given that Spline will be firing extra events.

I ran into this same issue while fixing another test - it appears Spark is doing this for both CTAS in regular Spark table and Hive table.

Some great news:

Some ok news:

Going to look into the POM piece now - there is a PR #793 addressing what I've done thus far. Would appreciate a look to see if we are fine with this approach.