Nike-Inc / koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
https://engineering.nike.com/koheesio/
Apache License 2.0
588 stars 17 forks source link

[FEATURE] Ensure that we can support DBR 14.3LTS #33

Open dannymeijer opened 3 months ago

dannymeijer commented 3 months ago

Is your feature request related to a problem? Please describe.

N/A

Describe the solution you'd like

We should add support for DBR 14.3 LTS

This means we need compatibility with:

Additionally, we need to look at how Spark Connect changes things for us. Any reference we have to JVM directly, we should investigate. Only Shared cluster mode is affected according to docs.

Describe alternatives you've considered

N/A

Additional context

N/A

mikita-sakalouski commented 1 month ago

The whole idea was to introduce the internal koheesio spark session to be able to provide easy switch between remote and local modes.

Also if I'm not wrong pydantic is checking SparkSession type based on the full import path and with remote spark session it is imported from a different path, at least it was in a such way sometime ago

pariksheet commented 1 month ago

https://docs.databricks.com/en/dev-tools/databricks-connect/python/limitations.html

pariksheet commented 1 month ago

check the affected/reference code within Koheesio

Not available on Databricks Connect for Databricks Runtime 13.3 LTS and below:

Not available:

pariksheet commented 1 month ago

Run the unit test locally with spark-connect remote instant.

pariksheet commented 1 month ago

Check how to manage SparkSession and DatabricksSession.

riccamini commented 1 month ago

I have added details in here related to foreachbatch function: https://github.com/Nike-Inc/koheesio/issues/56

If you prefer collecting everything here I will copy paste the comment and close the issue.

One additional point that I do not see in the list is Dataframe.rdd which is being used in some tests

pariksheet commented 31 minutes ago
pariksheet commented 26 minutes ago

There is a way to check Spark Session is remote or native.

we should introduce api/function to get spark session flag and check against the specific APIs e.g. delta merge /snowflake and raise the exception.