Open jaraco opened 1 month ago
Digged up from my stars, some projects we could take inspiration (or code) from:
dependenpy is meant for dependencies between modules, but it actually scans all imports.
I guess that we should really write our own AST visitor to handle conditions on Python versions to infer the right markers :slightly_smiling_face:
Initial indicators are that an LLM can do this work well. I asked Gemini for a basic result and it came up with the right answer:
I had to paste a screenshot as the "share" button isn't working currently.
Nice. I wonder how well it performs for projects that have non equal distribution and package name.
I do have to say though that I think it's easily solved with code and shouldn't need LLMs. Querying a LLM for each build would feel wasteful IMO. Not sure if this was your idea or if you had something else in mind :)
EDIT: Maybe not so easily solved for the same reason: projects that have different dist/package names ^^
I chose attr
because the distribution name is attrs
(subtle, but different).
Ah right, I didn't know it was different :+1:
In 8b577b2, I connected the Gemini-backed import inference. At first blush, I was impressed. It detected requests
and google-generativeai
as dependencies. But then I ran it on discovery.py
, and it failed with:
["dateutil", "jaraco.context", "jaraco.functools", "packaging", "pip-run", "requests", "setuptools-scm"]
That's mostly good, but dateutil
isn't the distribution name. It's python-dateutil
.
And it failed on bootstrap.py
, finding ['pathlib']
as a requirement. And on __main__.py
, it found ['runpy']
. It apparently doesn't even know when an import is from the stdlib. Even worse, on __init__.py
, it read the requirements out of __requires__
... maybe not a big surprise, but incorrect nonetheless.
So I'm convinced Gemini is not up to the task. Some have suggested Claude, so I'll try that and see if it's any better. If not, I'll abandon the AI approach for a more direct approach.
Claude does a little better. It at least recognizes that python-dateutil
is needed for dateutil
(although only when writing the requirements.txt and not when identifying the dependencies). Unfortunately, it fails to identify jaraco.context
as a dependency.
Since the gemini model seems more complete, it occurs to me that I could include things in the prompt to help keep it on the right track. Hints like "don't forget that dateutil
is supplied by the python-datetime
package."
In 909ce57, I tried that, but the model continued to emit dateutil
and started emitting python-dateutil
for modules that had no related imports.
I think a key part here is that it's impossible to get this 100% right globally, so we'll need to find (project-specific) ways of hinting the system towards the right solution. In the systems I've seen a comment on the import is generally effective:
from dateutil import * # pkg:pypi/python-dateutil
Then once we have a reasonably ergonomic way of hinting, the goal of the system would be to minimize the amount of hints required.
What about privacy? Or concerns about LLMs? IMO instead of querying LLMs we should instead search for a maintained map of package name
to distribution name
, or build such a map ourselves. Most projects have equal distribution and package name, surely it wouldn't be hard to update the map from time to time when there's a mismatch? We only have to keep track of the projects for which package and distribution names are different.
Erm, every question and answer I can find on the web all assume the packages are already installed on disk... Easy to get package name from dist name by querying pypi, but not the reverse.
Might be possible to scrape https://pypi.org/search/?q=dateutil to get relevant dists, then check their metadata to find one that provides a dateutil
package. It wouldn't prevent incorrect results though, as any dist can provide any package name.
Anyway, there will always be cases when we can't infer the right distribution name, and LLMs won't do better. For example when I use a recent fork (published to PyPI) that didn't change the package name. So @zsol's suggestion is probably the best thing we can do.
I think any use of LLMs isn't guaranteed to be deterministic. It would be a good idea to split the explored ideas into categories of deterministic and non-deterministic and prefer deterministic solutions for most cases. @zsol's suggestion is a very deterministic solution and I'm a huge fan of that approach—even though it's not necessarily "inferring" requirements then, but hinting, as he said. A modest problem is, annotating this near imports makes it possible for these comments to diverge in different imports from packages of the same name, e.g.
# ./foo.py
import dotenv # pkg:pypi/dotenv
# ./bar.py
import dotenv # pkg:pypi/python-dotenv
How nice would it be if dependencies could be inferred from the source code? Other non-public projects have achieved this with success (across languages).