delta-io / delta-examples

Delta Lake examples
Apache License 2.0
208 stars 76 forks source link

Install Delta with JAR #3

Closed MrPowers closed 1 year ago

MrPowers commented 2 years ago

@dennyglee - Great question. Here's my understanding of the Python dependency management situation:

From what I've seen, for web projects, something like poetry is definitely the best. It has deterministic, reproducible environments.

For Python data projects, conda is the best. I can send you this yaml file, you can run conda env create -f envs/pyspark-322-delta-121.yml and you can create a virtual environment that's roughly equivalent to what I have. You'll definitely have PySpark 3.2.2 & Delta 1.2.1 in that virtual environment, but there is no guarantee that conda resolves the other dependencies that aren't pinned to a specific version the same on your end.

Also note that conda and pip aren't mutually exclusive. You can pip install into a conda environment. That's what's being done in this environment file:

name: pyspark-322-delta-121
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pyspark=3.2.2
  - pip
  - pip:
    - delta-spark==1.2.1

We're pip installing delta-spark via conda because delta-spark is only published to PyPi and isn't published to conda.

Sorry for the rambling response, haha. Feel free to ask more questions. I am still learning about this.

dennyglee commented 2 years ago

Not at all - this is extremely informative. I'm wondering if we should create an issue to add delta-lake to conda-forge then, eh?!

bjornjorgensen commented 2 years ago

@dennyglee +1

https://github.com/jupyter/docker-stacks/issues/1746 We are waiting for a new release. One of the things that are holding me back from using delta is that it doesn't follow spark releases. CC mathbunnyru Bidek56