Open KOLANICH opened 3 years ago
What functionality do you suggest we make optional?
numpy, pandas
- are relevant mostly for data-scientists.
SQLAlchemy
- not relevant for everyone doing own queries written manually.
six, pyparsing, packaging, pluggy, protobuf, py, dateutil, pytz, wcwidth, more-itertools
- not used directly at all. Can and should be just removed.
Also:
>=
;<
condition should almost never be used, becuse of the same reason.I think on the version pinning I mostly agree with you, this article lays out your case. We actually originally built this package mostly to run on Docker containers, hence the widespread use of pinning. It's more aimed to be "reusable" rather than "reproducible" now, so that's a good shout.
As far as dependencies go we will remove anything that's not being used, agreed it shouldn't be in there. SQLAlchemy
on the other hand is used for our database interactions, and Dolt is a SQL database, so that's important for the doltpy.sql
package, as those higher level tools are built on top of that package.
Can I ask what your use-case is? Obviously we want to develop the package in ways that make sense for our users, so any help you can give me here would be appreciated.
@max-hoffman I think the action items here are:
Can I ask what your use-case is?
Not quite a use case, I don't use this currently. Just saw news on an IT-related resource about a release and thought it may make sense to play around this DB a bit.
Currently I use SQLite for storage of datasets. Currently I have no use cases for versioning, because schema migration is done manually. Usually it happens as follows:
commit
is called. The amount is picked to have the time between iterations about a second. It is an extremily long process, can take a week of a constant work on a HDD. The data has primary key of 2 columns, but I have done a nasty hack, I packed them 2 into a single integer, that is a rowid. And queries by rowids are fast in sqlite. But not on random insert. I have written a new importer, that performs much better because reduces random writes by using some RAM to pools close rowids together, but I haven't tested it enough to rely on it in prod.@KOLANICH this partial solution might be of interest to you: https://github.com/dolthub/doltcli
It's a bit easier for me to carve out a full-featured and dependency-less subsection of doltpy for engineers than re-write of the data science portions.
Thanks. Though I don't understand why it is called doltcli
, if it contains no CLI tool. It may make sense to rename that one to just dolt
(and the GitHub repo to dolt.py
), and this one to dolt-datascience
.
It's an internal distinction we make between using dolt's CLI and SQL interfaces. In the future we'd like to use a raw SQL session with pymysql or another connector to implement all of the version control features. Right now shelling out to the bash/CLI commands is the most reliable way for people to use dolt in python.
You're right that the code in the python package isn't a CLI -- but that's what dolt is for, and also the pypi package dolt is taken
Bumping up on that topic - We use dolt in DS project and while it is ok for having all the dependencies it has, it is unconvenient to have many of those pinned at some old version.
So far we've been selecting doltpy == 2.0.4, as this was the best version poetry
resolved for us, given our other dependencies. Now, we would like to use new doltpy
features, ( e.g. https://github.com/dolthub/doltpy/pull/183 ) and if we force doltpy >= 2.0.13
- we get it, but we get lots of dependencies downgraded, which is a pain, as some might be non-patched versions with potential issues.
What specific doltpy
features do you like?
Our strategy is to move most doltpy
functionality into SQL itself so you can use a standard sql connector or ORM like sqlalchemy
to access dolt throuugh a python interface.
As you can tell, a lot of the extra functionality of doltpy
is a little hard to maintain dependency-wise.
We're using doltpy
mostly as a python API for dolt ;]
we're using mostly data-access API: write_pandas
, read_pandas
, read_pandas_sql
, Dolt.sql
but as well some utilities like Dolt.ls()
, Dolt.branch()
, Dolt.push
I probably do not see all the complications, but it looks to me like it just require pandas and CLI wrapping. It's that we don't want to write this functionality on our own.
I don't see that many dependencies in https://github.com/dolthub/doltpy/blob/main/pyproject.toml#L10 - wouldn't it just work if you would make the version constraints >=
?
So, doltpy
allows you to do those things without running a server. Is that what you want?
If you start a dolt sql-server
and then connect using a mysql connector all those equivalent functions exist but just sending in SQL queries. That's the method we support more actively. It has a number of advantages:
The disadvantage is you have to run a dolt sql-server
, so there is a bit more of a bootstrapping problem when interacting with dolt.
Sorry to distract from the core of this ticket. I just want people who show up here to understand what our priorities are.
In parallel, I'm looking for a resource internally to take more active ownership over doltpy
. The person who conceived of it and built it left the company.
@kretes Without more info this could require a lot of back and forth. These options can unblock you immediately without action on our part depending on the context you are using doltpy
:
1) Use pip and --no-deps
, you have full control and as long as this is an internal app your dependencies do not have to be externally consistent as a library. I see that you are using poetry
, not pip, and I can understand that this is undesirable if you are writing a library package.
2) If you are writing a library and need to be dependency-less, the pandas helper functions are extremely thin and we can help you duplicate them into your package to minimize dependencies.
You are correct that some combination of lessening dependencies and rewriting how we manage extensions would help fix your use case, but 1) without knowing more about which deps are clashing I don't know how to immediately fix your use case, 2) lessening dependencies always runs the risk of breaking someone else who then needs to fiddle with their deps to find ones that mesh, and 3) this approach is not scalable, and we are trying to figure out a way to avoid manually releasing different versions of this package every time a user has a dependency issue.
Thanks @timsehn for sharing the background. Maybe you should at least add the current state to the README on top.
Thanks @max-hoffman for suggestions. Using --no-deps
is a way to hack around it. I know I can there are multiple problems with this approach, which you are probably aware of:
doltpy
, and some are required. Currently:
# pip freeze | grep dolt
doltpy @ file:///root/.cache/pypoetry/artifacts/d6/95/fe/feb697c1bdccf45552c7ce1602a3bac5504946c16b46233a42f31b09fa/doltpy-2.0.4.tar.gz
# pip install doltpy==2.0.13 --no-deps
...
# python -c 'from doltpy.cli import Dolt'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/root/.cache/pypoetry/virtualenvs/science-yd9zGFjU-py3.9/lib/python3.9/site-packages/doltpy/cli/__init__.py", line 1, in <module>
from doltcli import Dolt, Commit, DoltException, DoltHubContext # type: ignore
ModuleNotFoundError: No module named 'doltcli'
so I basically need to myself handle the essential dependency tree of doltpy
doltpy
's pyproject.toml
poetry.lock
, which is otherwise a standardThe whole request is to make clients' life easier and just add doltpy
as any other library and expect to get the newest version with a sensible dependencies specified as ranges (not pinned).
Replying to your questions:
1) without knowing more about which deps are clashing I don't know how to immediately fix your use case,
This is about all pinned in https://github.com/dolthub/doltpy/blob/main/pyproject.toml#L10
attrs
, pyparsing
2) lessening dependencies always runs the risk of breaking someone else who then needs to fiddle with their deps to find ones that mesh
I'm not sure I get it - if someone found that current doltpy
-s pinned versions work for them - he can stay with them, while doltpy
switch from =
to >=
, right? We don't make it harder, but easier to pick your own version.
3) this approach is not scalable, and we are trying to figure out a way to avoid manually releasing different versions of this package every time a user has a dependency issue. I agree. To get around you shouln't pin the versions but let them as broad as possible for
doltpy
so that as many clients can fit in the range.
Hi @kretes , thank you for taking the time to provide more context.
It would still be helpful if you shared your dependency clashes: package names and versions, if applicable.
Again, major version bumps are by definition backwards incompatible. Library developers do not always adhere to these guidelines, but the pattern generally describes a balancing act between two types of dependency problems. If dependency windows are too narrow, clients have a type-1 error, where packages that can run together are rejected by a dependency resolver because some combinations of dependency specs was overly strict (this is the issue you are experiencing). If dependency ranges are too wide, that will create the opposite type of error, where dependency resolving builds a list of packages that don't actually run together. In my experience, the second type of problem is much more difficult to diagnose and fix as a client.
I think it important to reflect on where we lay in this balancing act. And the answer is that we are probably leaning towards too strict right now. But we are not going to open unbounded ranges for every dependency. We will loosen the ones you ask for, while trying to plan for longer term solutions that remove python dependency concerns altogether.
@timsehn So, doltpy allows you to do those things without running a server. Is that what you want? @kretes it looks to me like [we] just require pandas and CLI wrapping. It's that we don't want to write this functionality on our own
I'm in a similar position here - using the pandas wrappers to pull Dolt tables into a ML training script. I'd rather not run the SQL server for this small task as it becomes one more prerequisite to launching a training. FWIW I also use doltcli
to serialize branch, author, and commit hash for data versioning, but as that package is in a separate repository for now I'm assuming it's not relevant to the discussion here?
I really appreciate the expanded numpy range in #178 since that's a common dependency for many ML/data-science packages. However, I'm still running into other issues with the version pinning and have resorted to a separate --no-deps
requirements.txt for doltpy
and doltcli
as @max-hoffman suggested.
Here are some particularly problematic pins I've identified for various package combinations using pip-compile
. Note in all cases, the output shows pins on the conflicting packages, but my input was always pip-compile -o requirements.txt tmp.txt
where tmp.txt
does not have any pins in order to allow the greatest chance of resolution
1) protobuf
from ray[tune] + doltpy
protobuf<4.0.0,>=3.15.3 (from ray[tune]==2.0.0->-r tmp.txt (line 2))
protobuf==3.12.2 (from doltpy==2.0.13->-r tmp.txt (line 1))
2) packaging
from lightning-flash[image] + doltpy
(similar issues with arise for effdet
)
packaging>=20.0 (from scikit-image==0.19.3->albumentations==1.3.0->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging>=17.0 (from pytorch-lightning==1.7.7->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging>=20.9 (from huggingface-hub==0.10.0->timm==0.6.11->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from torchmetrics==0.9.3->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging (from lightning-bolts==0.5.0->lightning-flash[image]==0.8.0->-r tmp.txt (line 2))
packaging==20.4 (from doltpy==2.0.13->-r tmp.txt (line 1))
3) python-dateutil
from icevision + doltpy
python-dateutil>=2.8.1 (from pandas==1.5.0->doltpy==2.0.13->-r tmp.txt (line 1))
python-dateutil==2.8.1 (from doltpy==2.0.13->-r tmp.txt (line 1))
python-dateutil>=2.7 (from matplotlib==3.6.0->yolov5-icevision==6.0.0->icevision==0.12.0->-r tmp.txt (line 2))
python-dateutil>=2.8.2 (from jupyter-client==7.3.5->ipykernel==5.5.6->icevision==0.12.0->-r tmp.txt (line 2))
I'm confused where the hard pins on protobuf
, packaging
, and python-dateutil
are getting pulled into doltpy
since I don't see them in the setup.py
In my experience, the second type of problem is much more difficult to diagnose and fix as a client.
You assume that a user of a python is a housewife using packages tor python as an end user. But packages of python are usually used by programmers as parts of own software. So it makes no sense not to break because of error of possible incompatible API change (which usually doesn't affect the most of libraries, so there is a big chance that even if major version has changed, everything will still work fine) and always break instead on a purely artificial indicator like version number change.
I took a pass at this yesterday, and the dependency resolver has been spinning for 14+ hours with the range constraints on those three packages. The volume of dependencies in Doltpy is more exponential every time we loosen one of these. The setup.py
was supposed to track the pyproject.toml
, but has fell out of maintenance since Doltpy is deprecated.
If I were you I would just take doltcli
+ the parquet read/write functions from doltpy. Your use case is legitimate but we need to solve it in a different way, with an embedded Dolt with Python bindings. Our customers pay for OLTP database features right now, until that changes we are focusing on server improvements.
It may make sense to make the most of deps optional.