benetech / VideoDeduplication

GNU General Public License v3.0
34 stars 12 forks source link

Organize mono-repo & extract ML core to separate PyPI-distributable library. #295

Open stepan-anokhin opened 3 years ago

stepan-anokhin commented 3 years ago

Problem

Currently the repository contains multiple applications with some shared logic and but different dependencies in general:

API server and repo-admin requires some of the dependencies from winnow, but not all of them. Some of the reusable parts are extracted into the packages that are placed at the repository root (e.g. task_queue, db). Also there are a lot of files that are related to deduplication app at the root, but not to the rest of the applications.

Problems:

As a result our monorepo gets disorganized and as we add more complexity the above problems will get worse.

Goals

Improve monorepo organization so that:

Possible solution:

We can consider an approach described in https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa A working example could be found here https://github.com/ya-mori/python-monorepo

The difficult part is that ML stuff uses conda dependency manager.

stepan-anokhin commented 3 years ago

Findings

poetry support for monorepositories is not complete yet but it is being actively discussed at the moment (see the corresponding feature request https://github.com/python-poetry/poetry/issues/936). It seems like poetry supports some of the monorepo features though (namely it allows to mix versioned and editable local path dependencies; see the corresponding https://github.com/pypa/packaging.python.org/issues/506#issuecomment-391140122). I've tested this approach and it seems to work well: all projects/libs use editable installs from the current codebase while build artifacts have versioned dependencies. So this is a good news.

Remaining Challenges

Investigate how to manage conda dependencies in ML-related packages. Some of the projects (e.g. server) share some logic with the dedup-app while at the same time don't need ML dependencies and conda all together, so they could rely only on poetry and python's standards. At the same time for ML-related projects (e.g. dedup-app) it is nice to have conda packages as they come pre-compiled and all necessary .so libraries comes with the conda installation out of the box. We need to figure out how to resolve this contradiction. So either some of the poetry projects need to depend on conda projects, or some of the conda projects need to depend on poetry projects, or some of the dependency management systems should be dropped in favor of another one.

Some Related Links

stepan-anokhin commented 3 years ago

Possible solution:

Rationale: We already do similar thing when we place db package at the repository root.

Links: