Nike-Inc / koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
https://engineering.nike.com/koheesio/
Apache License 2.0
602 stars 23 forks source link

[FEATURE] Ensure that we can support DBR 14.3LTS #33

Open dannymeijer opened 5 months ago

dannymeijer commented 5 months ago

Is your feature request related to a problem? Please describe.

N/A

Describe the solution you'd like

We should add support for DBR 14.3 LTS

This means we need compatibility with:

Additionally, we need to look at how Spark Connect changes things for us. Any reference we have to JVM directly, we should investigate. Only Shared cluster mode is affected according to docs.

Describe alternatives you've considered

N/A

Additional context

N/A

mikita-sakalouski commented 4 months ago

The whole idea was to introduce the internal koheesio spark session to be able to provide easy switch between remote and local modes.

Also if I'm not wrong pydantic is checking SparkSession type based on the full import path and with remote spark session it is imported from a different path, at least it was in a such way sometime ago

pariksheet commented 4 months ago

https://docs.databricks.com/en/dev-tools/databricks-connect/python/limitations.html

pariksheet commented 3 months ago

check the affected/reference code within Koheesio

Not available on Databricks Connect for Databricks Runtime 13.3 LTS and below:

Not available:

pariksheet commented 3 months ago

Run the unit test locally with spark-connect remote instant.

pariksheet commented 3 months ago

Check how to manage SparkSession and DatabricksSession.

riccamini commented 3 months ago

I have added details in here related to foreachbatch function: https://github.com/Nike-Inc/koheesio/issues/56

If you prefer collecting everything here I will copy paste the comment and close the issue.

One additional point that I do not see in the list is Dataframe.rdd which is being used in some tests

pariksheet commented 2 months ago
pariksheet commented 2 months ago

There is a way to check Spark Session is remote or native.

we should introduce api/function to get spark session flag and check against the specific APIs e.g. delta merge /snowflake and raise the exception.

pariksheet commented 1 month ago

-- use snowflake-connector-python instead of spark._jvm

dannymeijer commented 2 weeks ago

All of these should be addressed as part of release 0.9.0 (currently in pre-release). Please verify your usecases accordingly so we can proceed with the release.