Infer requirements from imports

jaraco commented 1 month ago

How nice would it be if dependencies could be inferred from the source code? Other non-public projects have achieved this with success (across languages).

pawamoy commented 1 month ago

Digged up from my stars, some projects we could take inspiration (or code) from:

https://github.com/fpgmaas/deptry: Find unused, missing and transitive dependencies in a Python project.
(shameless plug) https://github.com/pawamoy/dependenpy: Show the inter-dependencies between modules of Python packages.

dependenpy is meant for dependencies between modules, but it actually scans all imports.

I guess that we should really write our own AST visitor to handle conditions on Python versions to infer the right markers :slightly_smiling_face:

jaraco commented 1 month ago

Initial indicators are that an LLM can do this work well. I asked Gemini for a basic result and it came up with the right answer:

I had to paste a screenshot as the "share" button isn't working currently.

pawamoy commented 1 month ago

Nice. I wonder how well it performs for projects that have non equal distribution and package name.

I do have to say though that I think it's easily solved with code and shouldn't need LLMs. Querying a LLM for each build would feel wasteful IMO. Not sure if this was your idea or if you had something else in mind :)

EDIT: Maybe not so easily solved for the same reason: projects that have different dist/package names ^^

jaraco commented 1 month ago

I chose attr because the distribution name is attrs (subtle, but different).

pawamoy commented 1 month ago

Ah right, I didn't know it was different :+1:

jaraco commented 1 month ago

In 8b577b2, I connected the Gemini-backed import inference. At first blush, I was impressed. It detected requests and google-generativeai as dependencies. But then I ran it on discovery.py, and it failed with:

["dateutil", "jaraco.context", "jaraco.functools", "packaging", "pip-run", "requests", "setuptools-scm"]

That's mostly good, but dateutil isn't the distribution name. It's python-dateutil.

And it failed on bootstrap.py, finding ['pathlib'] as a requirement. And on __main__.py, it found ['runpy']. It apparently doesn't even know when an import is from the stdlib. Even worse, on __init__.py, it read the requirements out of __requires__... maybe not a big surprise, but incorrect nonetheless.

So I'm convinced Gemini is not up to the task. Some have suggested Claude, so I'll try that and see if it's any better. If not, I'll abandon the AI approach for a more direct approach.

jaraco commented 1 month ago

Claude does a little better. It at least recognizes that python-dateutil is needed for dateutil (although only when writing the requirements.txt and not when identifying the dependencies). Unfortunately, it fails to identify jaraco.context as a dependency.

jaraco commented 1 month ago

Since the gemini model seems more complete, it occurs to me that I could include things in the prompt to help keep it on the right track. Hints like "don't forget that dateutil is supplied by the python-datetime package."

jaraco commented 1 month ago

In 909ce57, I tried that, but the model continued to emit dateutil and started emitting python-dateutil for modules that had no related imports.

zsol commented 1 month ago

I think a key part here is that it's impossible to get this 100% right globally, so we'll need to find (project-specific) ways of hinting the system towards the right solution. In the systems I've seen a comment on the import is generally effective:

from dateutil import *  # pkg:pypi/python-dateutil

Then once we have a reasonably ergonomic way of hinting, the goal of the system would be to minimize the amount of hints required.

pawamoy commented 1 month ago

What about privacy? Or concerns about LLMs? IMO instead of querying LLMs we should instead search for a maintained map of package name to distribution name, or build such a map ourselves. Most projects have equal distribution and package name, surely it wouldn't be hard to update the map from time to time when there's a mismatch? We only have to keep track of the projects for which package and distribution names are different.

pawamoy commented 1 month ago

Erm, every question and answer I can find on the web all assume the packages are already installed on disk... Easy to get package name from dist name by querying pypi, but not the reverse.

pawamoy commented 1 month ago

Might be possible to scrape https://pypi.org/search/?q=dateutil to get relevant dists, then check their metadata to find one that provides a dateutil package. It wouldn't prevent incorrect results though, as any dist can provide any package name.

pawamoy commented 1 month ago

Anyway, there will always be cases when we can't infer the right distribution name, and LLMs won't do better. For example when I use a recent fork (published to PyPI) that didn't change the package name. So @zsol's suggestion is probably the best thing we can do.

bswck commented 1 week ago

I think any use of LLMs isn't guaranteed to be deterministic. It would be a good idea to split the explored ideas into categories of deterministic and non-deterministic and prefer deterministic solutions for most cases. @zsol's suggestion is a very deterministic solution and I'm a huge fan of that approach—even though it's not necessarily "inferring" requirements then, but hinting, as he said. A modest problem is, annotating this near imports makes it possible for these comments to diverge in different imports from packages of the same name, e.g.

# ./foo.py
import dotenv  # pkg:pypi/dotenv

# ./bar.py
import dotenv  # pkg:pypi/python-dotenv

coherent-oss / coherent.build

Infer requirements from imports #3