Closed wajda closed 1 day ago
Regarding Delta: As I understand it, they dropped support for using Delta without DSV2, which means Spark users will need to migrate to DSV2 Delta, but for us, it's no problem. All DeltaDSV2Spec tests are still passing.
Hi @wajda @cerveada any timelines for the agent to be available?
Sorry, no :( We are completely buried with another work. Upgrading to Spark 3.4 is not a priority for our organisation at the moment. If somebody from the community is up for implementing it we would be happy to accept pull-requests and release an upgrade.
Some work is already done in the linked draft PR.
Thanks @wajda @cerveada for the update. It would be great if you could notify when the development is completed. I will also check the feasibility of our development team on building on the pull-request.
Unfortunately we can't give any ETA on this due to reprioritisation and team capacity. As Adam said some work has started in the draft PR #739. It needs to be tested, it might require adding a separate bundle module for 3.4 and potentially other fixes. If you could help with it that would be amazing. Any questions please ask.
There were some question about what needs to be done to support the new Spark version. So it comes down to two things:
You can use #459 as inspiration for this. The current pr #739 is already solving part of the issues.
create pom for the new Spark uber jar
just a tip: we use the https://github.com/wajda/jars-pommefizer to generate a pom.xml
form a downloaded Apache Spark distribution. Then just make some minor corrections in the POM manually, by comparing it with another similar POM in the project.
Hey there @wajda / @cerveada - not looking to make any promises here until I understand the full amount of work that might need to be done here after going through the above comments.
If we can get a full build running successfully based off of #739 via the below, what else is there left to do?
mvn clean test -Pspark3.4
Edit: To add some clarity here, do all the tests passing in this profile address @cerveada 's concern?
run all tests on the new Spark version, find out where spark changed and fix the agent to accommodate those changes without breaking anything in the older versions.
I see
In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.
Thanks in advance!
By all tests, I meant, all unit tests and also all tests in integration-tests
maven module. If you run mvn clean test -Pspark3.4
in the root folder of repo, it should run all of them.
We use teamcity for ci, We can modify it ourselfs when this is ready.
In addition to the above, would it be possible to point us in the right direction for "BasicIntegrationTests: "saveAsTable and read.table" should "produce equal URIs"" test failures? I seem to have resolved the Kafka one already and trying to get a start on what seems to be the larger issue.
The test must validate that when you write to a data source and then read from it the URI will be identical. To simulate this I do somethin like this:
df = read(A)
df.createTable(B) // I compare this uri, but now it's A not B
df2 = read(B) // and this uri
I think the issue there is that the Spark will now give you the URI of the original data (A), not the artificial table (B) created from it. So it must be somehow improve or modified to test the same thing as before.
Hope it makes sense, I don't remember the actual issue. But from what I wrote here before I think this is it.
@cerveada Thanks this is helpful.
WRT the BasicIntegrationTests
issue I found something interesting.
I'm still learning my way around the codebase but I gathered I could find differences by running the test in 3.3 then in 3.4, while printing out the logical plan in LineageHarvester
with a quick print statement.
Interestingly enough, they look pretty close to the same, with a few new fields added in 3.4
Here comes the fun part - my print in LineageHarvester
is done twice on this test in 3.3, but four times in 3.4!
In 3.3 this makes sense - the test runs and creates two lineage events since there were two writes. There are two CreateDataSourceTableAsSelectCommand
commands in the output.
In 3.4 it gets weird - the same two events are above, but now two additional logical plans are created as well! Each write has an additional InsertIntoHadoopFsRelationCommand
that corresponds to data being inserted. Some of the semantics around this look a little different from the regular op mappings.
The test is then failing because the lineage captor for the second write is actually getting the second event for the first write. If I ignore the second event (by calling another captor) it actually passes! I don't know if this is the right thing to do given that Spline will be firing extra events.
I ran into this same issue while fixing another test - it appears Spark is doing this for both CTAS in regular Spark table and Hive table.
Some great news:
Some ok news:
Going to look into the POM piece now - there is a PR #793 addressing what I've done thus far. Would appreciate a look to see if we are fine with this approach.
there seems to be binary incompatible changes in API of Delta and Spark SQL that Spline core compiled against Spark 2.4 version cannot work with. E.g. RDDPlugin
Todo: