Develop a CRM script to handle import name mis matches

conda-incubator / conda-recipe-manager

A project for libraries and automated tools that manage and manipulate conda recipe files.

BSD 3-Clause "New" or "Revised" License

10 stars 8 forks source link

Develop a CRM script to handle import name mis matches #218

Open schuylermartin45 opened 2 weeks ago

schuylermartin45 commented 2 weeks ago

Resolves #181

The original ticket concept has changed a few times as others have pointed me to the cf-graph-countyfair data project.

We now cache a modified version of Conda Forge's mapping that attempts to map irregular cases where the "import name" does not match the package name.

In theory, this will make our Python dependency calculation efforts more accurate.

schuylermartin45 commented 2 weeks ago

I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.

That being said, I'm guessing we want to give some kind of attribution.

I imagine this goes one of two routes:

1) Copy what we did in conda-recipe-manager-test-data, and modify the top of the LICENSE file in this repo to include:

Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors

I'm guessing we could end in 2024 over 2018, but 2018 is the most recent listed in their LICENSE file.

2) Copy the names actually found in cf-graph-countyfair:

Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth

I imagine this file is very out of date and there have been many contributors post 2018 on the project not captured here.

@jezdez @beeankha Thoughts?

beckermr commented 2 weeks ago

You should read through https://github.com/conda/grayskull/issues/564 before you do much more on this PR. We are trying to cleanup the mapping situation and there are multiple existing solutions already. It'd be a real shame to introduce yet another variant into the mix instead of developing a single solution.

cc @maresb

schuylermartin45 commented 2 weeks ago

Hold on, before we continue, let us be 100% clear on what data I'm pulling, because I (and have seen others) have already gotten tripped up on this nuance.

I'm interested in import name to conda package name mapping, NOT the pypi to conda package name mapping.

They are not the same. In other words, I need to know that the conda package pillow gets imported with import PIL. Unless I'm missing something, a lot of the conversations I've seen on GitHub and the Element chat refer to the pypi mapping.

As far as I know, this section in cf-countyfair is the only datasource for the import mapping and this PR is just caching that data to save on network requests. In this first iteration, I don't need to constantly check for updates, it looks like this data is seldom updated.

If there is another way, let's talk. But I've already spent a few days trying to hunt down what data source to use. As far as I know, none of this is documented or explained anywhere. It is incredibly frustrating when everyone you talk to has a partial picture of the problem.

beckermr commented 2 weeks ago

Ah cool. That's all fine. In any case, the url you are using is not a supported api and could break. We should add an api to conda-forge-metadata to pull the file you need if it is not there.

beckermr commented 2 weeks ago

As far as I know, none of this is documented or explained anywhere.

Yep, this is a big hole. We've not been able to close it yet and that is definitely largely my fault. The hope is that as we add more to conda-forge-metadata, we will add documentation and this will be the place to look.

Thank you for your patience!

schuylermartin45 commented 2 weeks ago

Even if it is in an API, I'd still like to locally cache the results. Maybe it comes from my personal bias working in the IoT and firmware space, but network calls have costs. I'd rather not incur API hits on data that appears to change pretty infrequently.

So even if this ends up on conda-forge-metadata, I'll want some local form for redundancy and speed.

I'm not sure how long/if I have the time to add this to conda-forge-metadata but I'll try to at least look. This PR might be a stop gap for our more immediate needs.

At the very least, I'm going to make a note at the top of the script indicating that the conda-forge-metadata API exists and we should consider using it in the future. Glad we talked through this!

schuylermartin45 commented 2 weeks ago

Ok, so it is at least partially implemented: def map_import_to_package(import_name: str) -> str:

But obviously I'd have to query every single string. The table this PR generates for the cache is only ~300kb, but I'd imagine we'd want some pagination scheme to future-proof, if we allow for a fully query of the mapping.

schuylermartin45 commented 2 weeks ago

I collected my thoughts together and opened this against conda-forge-metadata: https://github.com/conda-forge/conda-forge-metadata/issues/55

beckermr commented 2 weeks ago

We can add an API to the file import_name_priority_mapping.json you are using. That file is made from the underlying mapping results from each package with some graph heuristics applied to resolve conflicts.