delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.59k stars 1.71k forks source link

[Feature Request] Publish delta-spark to Conda #1063

Closed boonware closed 1 year ago

boonware commented 2 years ago

Feature request

Overview

Publish the Python delta-spark package to a public Conda channel so that users of Conda can use Delta Lake. As of now, the package is only available at Pypi.

Motivation

Many users, particularly in data science, leverage Conda for package management.

Further details

Some packages are already available on Conda Forge, including delta-sharing.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

boonware commented 2 years ago

@vkorukanti Would this be straightfoward to put in place?

vkorukanti commented 2 years ago

@vkorukanti Would this be straightfoward to put in place?

Sorry I don't have full context on on what is required for publishing to Conda. We already publish to pypi. Can we extend it to do the same for Conda?

boonware commented 2 years ago

Conda is a separate package management solution. The following might help you get started, assuming Conda Forge is the right place to publish the package (?).

  1. Install Conda, see here.
  2. Define the Conda project by creating an environment file called env.yaml with the following contents:
    name: delta-spark
    channels:
    - conda-forge
    dependencies:
    - conda-build
  3. Create the Conda environment for the project: conda env create -f env.yaml
  4. The above is sufficient for creating a project, but in order to build and publish it we also need a recipe/meta file. Conda Forge mention using grayskull to generate a recipe template for you, see here. The template needs to be filled in with project-specific info.
scholer commented 2 years ago

Having delta-spark available on conda-forge would be fantastic.

We often use conda-forge to set up environments. The Conda ecosystem still has advantages over plain pypi, particularly for more complex environments, or if the environment requires non-python packages. With conda, I can create a complete environment for running Spark -- including e.g. OpenJVM and other things that are not available on pypi. If I wanted to do the same thing without conda, I would probably have to resort to Docker. And if I am using conda/mamba to set up my environment, it is best if as many packages as possible are available on conda-forge, so conda/mamba can find the best dependency solution.

tdas commented 2 years ago

I agree with conda being more powerful than pypi. It would be great if someone can contribute this.

MrPowers commented 2 years ago

I agree that it'd be great to have delta-spark in Conda. In the meantime, just posting an example environment file demonstrating how delta-spark can be installed via pip into a Conda environment:

name: pyspark-322-delta-121
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pyspark=3.2.2
  - pip
  - pip:
    - delta-spark==1.2.1
shubhamp051991 commented 2 years ago

The setup.py file references a version.sbt, which is missing from the PyPI delta-spark, and that will create an issue if we create the meta.YAML for conda-forge, any guidance on this would be helpful and I can work on this first issue

scottsand-db commented 2 years ago

@MrPowers - do you know the latest status of this? I recall you saying you knew a member of the community that was interested in working on this? Can you follow up with them and/or assign the issue to them?

KevinAppelBofa commented 2 years ago

@MrPowers @scottsand-db is there any update on this? I am adding in the delta lake for the first time for our group now to use starting with Spark 3.3.1 and this would be great if we can get this out of the conda and not have to do a pip pull I just built this now so I can get this setup in our conda environment, this would be great for the next time Spark 3.4 is out if there would be a conda version of the delta posted.

Based on the last release it is version 2.1.1, i do a git clone and then find the commit tied to this tag and use this for the meta.yaml; after this you just run the conda-build command, ie conda-build delta

hopefully this helps on getting the package built, i'm not sure on how they get uploaded though to conda-forge

meta.yaml

{% set name = "delta" %}
{% set version = "2.1.1" %}

package:
  name: "{{ name|lower }}"
  version: "{{ version }}"

source:
  git_url: "/scratch/fromgit-branch-2.1.1/delta"
  git_rev: d8c4fc17c25d6b5e0e9b3ebe1ff4cba39ecb39c5

build:
  number: 0
  noarch: python
  script: "{{ PYTHON }} setup.py install"

requirements:
  host:
    - python=3.9
    - pyspark=3.3.1
    - importlib_metadata
  run:
    - python=3.9
    - pyspark=3.3.1
    - importlib_metadata

test:
  imports:
    - delta

about:
  home: https://github.com/delta-io/delta
  summary: An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
nkarpov commented 1 year ago

I've submitted a PR https://github.com/conda-forge/staged-recipes/pull/21556 for this. It's passing all the tests so just waiting for a review now.

@shubhamp051991 I was able to resolve that issue (and another similar one with another missing file) by adding the required files as additional sources in the conda meta.yaml (refer to the PR for more details)

Once the PR is reviewed and active for the most recent release, we can add something in this repo to generate future release conda meta.yaml

MrPowers commented 1 year ago

Here's the conda-forge link to the package: https://anaconda.org/conda-forge/delta-spark

dennyglee commented 1 year ago

Perhaps we should also create a new feature request to publish deltalake to conda as well?

scottsand-db commented 1 year ago

@nkarpov isn't this done?

zsxwing commented 1 year ago

Closing this. We have published it to conda: https://anaconda.org/conda-forge/delta-spark