Point72 / csp

csp is a high performance reactive stream processing library, written in C++ and Python
https://github.com/Point72/csp/wiki
Apache License 2.0
221 stars 37 forks source link

Onboard to conda-forge #10

Closed ngoldbaum closed 8 months ago

ngoldbaum commented 9 months ago

Right now csp bootstraps the majority of build dependencies using vcpkg. However, if we want to eventually add a build of csp to conda-forge we will need to define a conda recipe that gets its build dependencies from conda-forge instead.

From vcpkg.json we have:

{
    "name": "main",
    "version-string": "latest",
    "dependencies": [
      "abseil",
      "arrow",
      "brotli",
      "exprtk",
      "gtest",
      "librdkafka",
      "lz4",
      "parquet",
      "protobuf",
      "rapidjson",
      "thrift",
      "utf8proc"
    ],
    "overrides": [
      { "name": "arrow", "version": "7.0.0"}
    ],
    "builtin-baseline": "288e8bebf4ca67c1f8ebd49366b03650cfd9eb7d"
  }

This corresponds to the following conda-forge dependencies:

abseil-cpp
arrow-cpp 7.0.0
brotli
exprtk
gtest
librdkafka
lz4
libparquet
protobuf
rapidjson
thrift
utf8proc

The main problem I can see in this dependency list is the pin on arrow-cpp 7.0.0. It looks conda-forge hasn't updated this version in 1 year 8 months. Version 7.0.1 was updated a little more recently, a year ago.

Looking at older arrow-cpp packages on conda-forge, since they depend on Python, the pin would also restrict us from supporting newer python versions, since python 3.11 and 2.12 don't have arrow-cpp 7.0.0 packages available. It looks like arrow-cpp dropped its python dependency starting with arrow-cpp 10.

I don't have context for why the pin on arrow is set the way it is. Can it be loosened at all?

timkpaine commented 9 months ago

Build time C++ dependencies will come from vcpkg even after onboarding to conda-forge. Our conda-forge recipe will look like:

{% set name = "csp" %}
{% set version = "0.1.0" %}

package:
  name: {{ name|lower }}
  version: {{ version }}

source:
  url: https://pypi.io/packages/source/{{ name [0] }}/{{ name }}/{{ name }}-{{ version }}.tar.gz
  sha256: <TBD>

build:
  number: 0
  script:
    - {{ PYTHON }} -m pip install . -vv  # [not win]
    - set PSP_GENERATOR=Visual Studio 16 2019  # [win]
    - {{ PYTHON }} setup.py build_ext install --single-version-externally-managed  --record=record.txt  # [win]
requirements:
  build:
    - python                                 # [build_platform != target_platform]
    - cross-python_{{ target_platform }}     # [build_platform != target_platform]
    - numpy                                  # [build_platform != target_platform]
    - {{ compiler('c') }}
    - {{ compiler('cxx') }}
    - cmake
    - make  # [not win]
  host:
    - numpy
    - pip
    - python
  run:
    - python
    - {{ pin_compatible('numpy') }}
    - pandas <2.2
    - psutil
    - pytz
    - sqlalchemy <2

test:
  imports:
    - csp

about:
  home: http://github.com/point72/csp
  license: Apache-2.0
  license_family: Apache
  license_file: LICENSE
  summary: csp is a high performance reactive stream processing library, written in C++ and Python
  description: csp is a high performance reactive stream processing library, written in C++ and Python
  dev_url: https://github.com/point72/csp

extra:
  recipe-maintainers:
    - timkpaine

Will open a PR there once we're publically released and published.

ngoldbaum commented 9 months ago

The dependencies in pyproject.toml correspond to the following conda-forge package list:

Build requirements:

cmake
numpy
pyarrow >=7.0.0
ruamel.yaml
scikit-build

Runtime dependencies:

numpy
pandas
psutil
pyarrow >= 7.0.0
pytz
sqlalchemy < 2

Development dependencies:

bump2version >= 1.0.0
graphviz
httpx
isort
ruamel.yaml
ruff
twine
wheel

Test dependencies:

polars
psutil
pytest
pytest-cov
pytest-sugar
scikit-build
sqlalchemy<2

I don't see foresee any packaging issues related to these, all appear to be packaged on conda-forge already.

ngoldbaum commented 9 months ago

Build time C++ dependencies will come from vcpkg even after onboarding to conda-forge.

Unfortunately, I don't think conda-forge allows this. I don't see any existing packages getting dependencies from vcpkg in any extant feedstock. Will double-check about this though.

timkpaine commented 9 months ago

They do, in general they don't like you bringing in your own external libraries but the reasoning is for ABI compatibility. If you naively e.g. bring in a bunch of shared libraries, you're going to have a bad time, but its somewhat common to vendor your own deps and not reexport (which is often necessary as conda-forge also doesn't like to provide static libraries).

ngoldbaum commented 9 months ago

I asked a conda-forge maintainer about this, their response was:

Build dependencies need to be provided as conda packages in almost all cases. Rare exceptions can be made for build dependencies that will not make it to the produced artifact (e.g. no compilers, remnants in the linked libraries). It could be ok for things like “I need this tool to autogenerate these headers before running make”, but I can’t think of other good reasons.

timkpaine commented 9 months ago

This is the way we'll be doing it. We do similar things elsewhere on existing conda-forge packages. Thanks for the context though!

ngoldbaum commented 9 months ago

OK, thanks. I'm mainly looking for examples, can you share me which packages you're familiar with that use vcpkg for vendored dependencies?

timkpaine commented 9 months ago

Micromamba vendors with vcpkg: https://github.com/conda-forge/micromamba-feedstock Ray vendors deps (in a bad way): https://github.com/conda-forge/ray-packages-feedstock I have a few examples myself that vendor things as headers, e.g. https://github.com/conda-forge/perspective-feedstock I'm sure there are more examples people can find.

Its also pretty common to find issues with providing static libraries: https://github.com/conda-forge/boost-feedstock/issues/159

Which is in line with: https://github.com/conda-forge/cfep/blob/main/cfep-18.md

In general, its difficult to make the conda build and the pypa sdist build play nicely when they want to have the exact opposite (conda all your deps come from conda, pypa you don't have any deps unless installed on the machine), so usually statically linking your deps in without reexport keeps your build sane and compatible in both ecosystems without running afoul of some of these thorns.

ngoldbaum commented 9 months ago

@timkpaine reading over your conda recipe above - how is it able to build csp without g++ on a mac? Or did you never try doing that?

timkpaine commented 9 months ago

@ngoldbaum its not going to work for mac 😆 we'll just get linux going for now

timkpaine commented 9 months ago

note that after https://github.com/Point72/csp/pull/22, we will need

run_constrained:
    - slack-sdk >=3

ref: https://github.com/Point72/csp/issues/35

isuruf commented 9 months ago

Micromamba is the only package that I know that does vcpkg and that's only because of windows and issues with building static libraries across visual studio versions.

I'm not convinced that using vcpkg and static libraries is better here.

timkpaine commented 9 months ago

@isuruf many bazel-based projects bring in external deps themselves, as do e.g. every single rust/python package. I don't think the argument is that its preferable in the absolute sense (as it makes for longer builds) but allowing a package to either use vcpkg or detect system packages will result in a LOT of separate, brittle CMake code for each of the depencencies, all of which needs to be maintained in parallel with little to no runtime benefit.

isuruf commented 9 months ago

Yes, bazel projects and rust projects bring in lots of external deps themselves and we have lots of issues with bazel projects due to that. (rust projects less so because they mostly use rust external deps)

The issue with vendoring is that a project has to maintain the dependencies themselves. When there are updates/bug-fixes/security-fixes to dependencies, the package needs to be rebuilt instead of just updating the dependencies. This means that only the recent versions get these updates/bug-fixes/security-fixes. It also results in larger packages and also takes longer.

timkpaine commented 9 months ago

@isuruf yep of course, i think the tradeoff is on maintenance of 2 sets of dependency management scripts (e.g. 1 for vcpkg toolchain necessary on certain platforms, e.g. windows and mac if cross compiling, and 1 for "external deps" including conda) vs being up-to-date on deps (e.g. we link against openssl and 3 had vulnerabilities recently). For active projects, dealing with shifting dependencies and non-normalized deps where you need to write and maintain many of your own FindPackage.cmake files, the former is worse than the latter. From a conda-forge perspective, we'll have longer build times but we shouldn't have the incompatibiltiies of e.g. bazel packages due to the linkage mechanism. Also our build time is much shorter than many other packages I maintain on conda forge, so I wouldn't think that would be a barrier to onboarding.