This package has too many nonoptional dependencies

KOLANICH commented 3 years ago

It may make sense to make the most of deps optional.

oscarbatori commented 3 years ago

What functionality do you suggest we make optional?

KOLANICH commented 3 years ago

numpy, pandas - are relevant mostly for data-scientists.

SQLAlchemy - not relevant for everyone doing own queries written manually.

six, pyparsing, packaging, pluggy, protobuf, py, dateutil, pytz, wcwidth, more-itertools - not used directly at all. Can and should be just removed.

Also:

never pin exact versions in package manifests, it cause version conflicts and downgrades, almost always you need >=;
< condition should almost never be used, becuse of the same reason.

oscarbatori commented 3 years ago

I think on the version pinning I mostly agree with you, this article lays out your case. We actually originally built this package mostly to run on Docker containers, hence the widespread use of pinning. It's more aimed to be "reusable" rather than "reproducible" now, so that's a good shout.

As far as dependencies go we will remove anything that's not being used, agreed it shouldn't be in there. SQLAlchemy on the other hand is used for our database interactions, and Dolt is a SQL database, so that's important for the doltpy.sql package, as those higher level tools are built on top of that package.

Can I ask what your use-case is? Obviously we want to develop the package in ways that make sense for our users, so any help you can give me here would be appreciated.

oscarbatori commented 3 years ago

@max-hoffman I think the action items here are:

[ ] audit use of version pins, and make a principled decision
[ ] remove unnecessary dependencies (we talked about this earlier in the week, so might be a good nudge to just do it)

KOLANICH commented 3 years ago

Can I ask what your use-case is?

Not quite a use case, I don't use this currently. Just saw news on an IT-related resource about a release and thought it may make sense to play around this DB a bit.

Currently I use SQLite for storage of datasets. Currently I have no use cases for versioning, because schema migration is done manually. Usually it happens as follows:

4 times a year the source of the dataset publishes a new version of dataset - the data from the previous quarter. It is a large archive of csv files.
with such datasets come free-form kind of release notes describing what's new and what unusual has happenned and what issues the dataset has.
also a set of SQL scripts comes.
I diff the scripts to the one of the previous quarter and determine the new columns in the table. Each column corresponds to an integer in the original data returned by hardware.
I google the net for the info on additional columns to get their human readable names and add them to my script, and also determine if the column is useful for me or not.
my script automatically generates the commands to add the needed columns into the table
my scripts insert the data from csv into a temporary table, and then normalize the data, moving some of them into other tables. This all is done using hand-written partially templated SQL queries called iteratively. After some amount of iterations commit is called. The amount is picked to have the time between iterations about a second. It is an extremily long process, can take a week of a constant work on a HDD. The data has primary key of 2 columns, but I have done a nasty hack, I packed them 2 into a single integer, that is a rowid. And queries by rowids are fast in sqlite. But not on random insert. I have written a new importer, that performs much better because reduces random writes by using some RAM to pools close rowids together, but I haven't tested it enough to rely on it in prod.
Then I do "fast vacuum". It performs much better than a vacuum because some journaling and robustnes against crashes is removed, but I do a backup before that anyway.
Then I can make specific analytic queries relying on the structure I encoded into rowid pretty fast (data with sequential rowids are stored on disk sequentially) and create a mini version of the DB having only a subset of info. It is very much smaller than the main db, a few MiB (~ 6MiB the last time), while the main DB is tens of GiBs (I guess more then by now it shold be > 50 GiBs, the last time I updated the dataset it was about ~ 20 GiBs uncompressed).
Then I compress the DBs at maximal compression using 7zip lzma2 (results about 2 GiB, and it'd be nice to be able to work with the main DB in its compressed state at least on reading) and upload them into a cloud storage as a backup.
then I run data-science scripts on the small db.

max-hoffman commented 3 years ago

@KOLANICH this partial solution might be of interest to you: https://github.com/dolthub/doltcli

It's a bit easier for me to carve out a full-featured and dependency-less subsection of doltpy for engineers than re-write of the data science portions.

KOLANICH commented 3 years ago

Thanks. Though I don't understand why it is called doltcli, if it contains no CLI tool. It may make sense to rename that one to just dolt (and the GitHub repo to dolt.py), and this one to dolt-datascience.

max-hoffman commented 3 years ago

It's an internal distinction we make between using dolt's CLI and SQL interfaces. In the future we'd like to use a raw SQL session with pymysql or another connector to implement all of the version control features. Right now shelling out to the bash/CLI commands is the most reliable way for people to use dolt in python.

You're right that the code in the python package isn't a CLI -- but that's what dolt is for, and also the pypi package dolt is taken

kretes commented 2 years ago

Bumping up on that topic - We use dolt in DS project and while it is ok for having all the dependencies it has, it is unconvenient to have many of those pinned at some old version.

So far we've been selecting doltpy == 2.0.4, as this was the best version poetry resolved for us, given our other dependencies. Now, we would like to use new doltpy features, ( e.g. https://github.com/dolthub/doltpy/pull/183 ) and if we force doltpy >= 2.0.13 - we get it, but we get lots of dependencies downgraded, which is a pain, as some might be non-patched versions with potential issues.

timsehn commented 2 years ago

What specific doltpy features do you like?

Our strategy is to move most doltpy functionality into SQL itself so you can use a standard sql connector or ORM like sqlalchemy to access dolt throuugh a python interface.

As you can tell, a lot of the extra functionality of doltpy is a little hard to maintain dependency-wise.

kretes commented 2 years ago

We're using doltpy mostly as a python API for dolt ;] we're using mostly data-access API: write_pandas, read_pandas, read_pandas_sql, Dolt.sql but as well some utilities like Dolt.ls(), Dolt.branch(), Dolt.push

I probably do not see all the complications, but it looks to me like it just require pandas and CLI wrapping. It's that we don't want to write this functionality on our own.

I don't see that many dependencies in https://github.com/dolthub/doltpy/blob/main/pyproject.toml#L10 - wouldn't it just work if you would make the version constraints >= ?

timsehn commented 2 years ago

So, doltpy allows you to do those things without running a server. Is that what you want?

If you start a dolt sql-server and then connect using a mysql connector all those equivalent functions exist but just sending in SQL queries. That's the method we support more actively. It has a number of advantages:

It works in every language
There's a really developed SQL tool ecosystem already
You only take one client dependency: the mysql connector in your language

The disadvantage is you have to run a dolt sql-server, so there is a bit more of a bootstrapping problem when interacting with dolt.

Sorry to distract from the core of this ticket. I just want people who show up here to understand what our priorities are.

In parallel, I'm looking for a resource internally to take more active ownership over doltpy. The person who conceived of it and built it left the company.

max-hoffman commented 2 years ago

@kretes Without more info this could require a lot of back and forth. These options can unblock you immediately without action on our part depending on the context you are using doltpy:

1) Use pip and --no-deps, you have full control and as long as this is an internal app your dependencies do not have to be externally consistent as a library. I see that you are using poetry, not pip, and I can understand that this is undesirable if you are writing a library package.

2) If you are writing a library and need to be dependency-less, the pandas helper functions are extremely thin and we can help you duplicate them into your package to minimize dependencies.

You are correct that some combination of lessening dependencies and rewriting how we manage extensions would help fix your use case, but 1) without knowing more about which deps are clashing I don't know how to immediately fix your use case, 2) lessening dependencies always runs the risk of breaking someone else who then needs to fiddle with their deps to find ones that mesh, and 3) this approach is not scalable, and we are trying to figure out a way to avoid manually releasing different versions of this package every time a user has a dependency issue.

kretes commented 2 years ago

Thanks @timsehn for sharing the background. Maybe you should at least add the current state to the README on top.

Thanks @max-hoffman for suggestions. Using --no-deps is a way to hack around it. I know I can there are multiple problems with this approach, which you are probably aware of:

I'm not getting any dependencies of doltpy, and some are required. Currently:

# pip freeze | grep dolt
doltpy @ file:///root/.cache/pypoetry/artifacts/d6/95/fe/feb697c1bdccf45552c7ce1602a3bac5504946c16b46233a42f31b09fa/doltpy-2.0.4.tar.gz
# pip install doltpy==2.0.13 --no-deps
...
# python -c 'from doltpy.cli import Dolt'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/root/.cache/pypoetry/virtualenvs/science-yd9zGFjU-py3.9/lib/python3.9/site-packages/doltpy/cli/__init__.py", line 1, in <module>
from doltcli import Dolt, Commit, DoltException, DoltHubContext  # type: ignore
ModuleNotFoundError: No module named 'doltcli'

so I basically need to myself handle the essential dependency tree of doltpy

I don't know which one are optional and which one will work ok when forced to be in different version than specified in doltpy's pyproject.toml
I introduce a different way of managing dependencies rather than a poetry.lock, which is otherwise a standard

The whole request is to make clients' life easier and just add doltpy as any other library and expect to get the newest version with a sensible dependencies specified as ranges (not pinned).

Replying to your questions:

1) without knowing more about which deps are clashing I don't know how to immediately fix your use case,

This is about all pinned in https://github.com/dolthub/doltpy/blob/main/pyproject.toml#L10

some libraries are already 1 or 2 major versions behind - attrs, pyparsing

2) lessening dependencies always runs the risk of breaking someone else who then needs to fiddle with their deps to find ones that mesh

I'm not sure I get it - if someone found that current doltpy-s pinned versions work for them - he can stay with them, while doltpy switch from = to >= , right? We don't make it harder, but easier to pick your own version.

3) this approach is not scalable, and we are trying to figure out a way to avoid manually releasing different versions of this package every time a user has a dependency issue. I agree. To get around you shouln't pin the versions but let them as broad as possible for doltpy so that as many clients can fit in the range.

max-hoffman commented 2 years ago

Hi @kretes , thank you for taking the time to provide more context.

It would still be helpful if you shared your dependency clashes: package names and versions, if applicable.

Again, major version bumps are by definition backwards incompatible. Library developers do not always adhere to these guidelines, but the pattern generally describes a balancing act between two types of dependency problems. If dependency windows are too narrow, clients have a type-1 error, where packages that can run together are rejected by a dependency resolver because some combinations of dependency specs was overly strict (this is the issue you are experiencing). If dependency ranges are too wide, that will create the opposite type of error, where dependency resolving builds a list of packages that don't actually run together. In my experience, the second type of problem is much more difficult to diagnose and fix as a client.

I think it important to reflect on where we lay in this balancing act. And the answer is that we are probably leaning towards too strict right now. But we are not going to open unbounded ranges for every dependency. We will loosen the ones you ask for, while trying to plan for longer term solutions that remove python dependency concerns altogether.

addisonklinke commented 2 years ago

@timsehn So, doltpy allows you to do those things without running a server. Is that what you want? @kretes it looks to me like [we] just require pandas and CLI wrapping. It's that we don't want to write this functionality on our own

I'm in a similar position here - using the pandas wrappers to pull Dolt tables into a ML training script. I'd rather not run the SQL server for this small task as it becomes one more prerequisite to launching a training. FWIW I also use doltcli to serialize branch, author, and commit hash for data versioning, but as that package is in a separate repository for now I'm assuming it's not relevant to the discussion here?

I really appreciate the expanded numpy range in #178 since that's a common dependency for many ML/data-science packages. However, I'm still running into other issues with the version pinning and have resorted to a separate --no-deps requirements.txt for doltpy and doltcli as @max-hoffman suggested.

Here are some particularly problematic pins I've identified for various package combinations using pip-compile. Note in all cases, the output shows pins on the conflicting packages, but my input was always pip-compile -o requirements.txt tmp.txt where tmp.txt does not have any pins in order to allow the greatest chance of resolution

1) protobuf from ray[tune] + doltpy

protobuf<4.0.0,>=3.15.3 (from ray[tune]==2.0.0->-r tmp.txt (line 2))
protobuf==3.12.2 (from doltpy==2.0.13->-r tmp.txt (line 1))

2) packaging from lightning-flash[image] + doltpy (similar issues with arise for effdet)

packaging>=20.0 (from scikit-image==0.19.3->albumentations==1.3.0->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging>=17.0 (from pytorch-lightning==1.7.7->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging>=20.9 (from huggingface-hub==0.10.0->timm==0.6.11->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from torchmetrics==0.9.3->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from lightning-bolts==0.5.0->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging==20.4 (from doltpy==2.0.13->-r tmp.txt (line 1))

3) python-dateutil from icevision + doltpy

python-dateutil>=2.8.1 (from pandas==1.5.0->doltpy==2.0.13->-r tmp.txt (line 1))
python-dateutil==2.8.1 (from doltpy==2.0.13->-r tmp.txt (line 1))
python-dateutil>=2.7 (from matplotlib==3.6.0->yolov5-icevision==6.0.0->icevision==0.12.0->-r tmp.txt (line 2))
python-dateutil>=2.8.2 (from jupyter-client==7.3.5->ipykernel==5.5.6->icevision==0.12.0->-r tmp.txt (line 2))

I'm confused where the hard pins on protobuf, packaging, and python-dateutil are getting pulled into doltpy since I don't see them in the setup.py

KOLANICH commented 2 years ago

In my experience, the second type of problem is much more difficult to diagnose and fix as a client.

You assume that a user of a python is a housewife using packages tor python as an end user. But packages of python are usually used by programmers as parts of own software. So it makes no sense not to break because of error of possible incompatible API change (which usually doesn't affect the most of libraries, so there is a big chance that even if major version has changed, everything will still work fine) and always break instead on a purely artificial indicator like version number change.

max-hoffman commented 2 years ago

I took a pass at this yesterday, and the dependency resolver has been spinning for 14+ hours with the range constraints on those three packages. The volume of dependencies in Doltpy is more exponential every time we loosen one of these. The setup.py was supposed to track the pyproject.toml, but has fell out of maintenance since Doltpy is deprecated.

If I were you I would just take doltcli + the parquet read/write functions from doltpy. Your use case is legitimate but we need to solve it in a different way, with an embedded Dolt with Python bindings. Our customers pay for OLTP database features right now, until that changes we are focusing on server improvements.

dolthub / doltpy

This package has too many nonoptional dependencies #132