Open schuylermartin45 opened 2 weeks ago
I'm unclear on the licensing implications of this change. I wouldn't consider the data derived from Conda Forge here as "source code", but I'm not sure what the legal interpretation of "source code" is over my engineering interpretation.
That being said, I'm guessing we want to give some kind of attribution.
I imagine this goes one of two routes:
1) Copy what we did in conda-recipe-manager-test-data
, and modify the top of the LICENSE file in this repo to include:
Copyright (c) 2016-2018, conda-forge
Copyright (c) 2016-2018, conda-forge contributors
I'm guessing we could end in 2024
over 2018
, but 2018
is the most recent listed in their LICENSE file.
2) Copy the names actually found in cf-graph-countyfair
:
Copyright (c) 2018, Board of Trustees of Columbia University in the city of New York.
Copyright (c) 2017, Peter M. Landwehr
Copyright (c) 2017, Anthony Scopatz
Copyright (c) 2016 Aaron Meurer, Gil Forsyth
I imagine this file is very out of date and there have been many contributors post 2018
on the project not captured here.
@jezdez @beeankha Thoughts?
You should read through https://github.com/conda/grayskull/issues/564 before you do much more on this PR. We are trying to cleanup the mapping situation and there are multiple existing solutions already. It'd be a real shame to introduce yet another variant into the mix instead of developing a single solution.
cc @maresb
Hold on, before we continue, let us be 100% clear on what data I'm pulling, because I (and have seen others) have already gotten tripped up on this nuance.
I'm interested in import
name to conda package
name mapping, NOT the pypi
to conda package
name mapping.
They are not the same. In other words, I need to know that the conda package
pillow
gets imported with import PIL
. Unless I'm missing something, a lot of the conversations I've seen on GitHub and the Element chat refer to the pypi
mapping.
As far as I know, this section in cf-countyfair
is the only datasource for the import
mapping and this PR is just caching that data to save on network requests. In this first iteration, I don't need to constantly check for updates, it looks like this data is seldom updated.
If there is another way, let's talk. But I've already spent a few days trying to hunt down what data source to use. As far as I know, none of this is documented or explained anywhere. It is incredibly frustrating when everyone you talk to has a partial picture of the problem.
Ah cool. That's all fine. In any case, the url you are using is not a supported api and could break. We should add an api to conda-forge-metadata to pull the file you need if it is not there.
As far as I know, none of this is documented or explained anywhere.
Yep, this is a big hole. We've not been able to close it yet and that is definitely largely my fault. The hope is that as we add more to conda-forge-metadata, we will add documentation and this will be the place to look.
Thank you for your patience!
Even if it is in an API, I'd still like to locally cache the results. Maybe it comes from my personal bias working in the IoT and firmware space, but network calls have costs. I'd rather not incur API hits on data that appears to change pretty infrequently.
So even if this ends up on conda-forge-metadata
, I'll want some local form for redundancy and speed.
I'm not sure how long/if I have the time to add this to conda-forge-metadata
but I'll try to at least look. This PR might be a stop gap for our more immediate needs.
At the very least, I'm going to make a note at the top of the script indicating that the conda-forge-metadata
API exists and we should consider using it in the future. Glad we talked through this!
Ok, so it is at least partially implemented: def map_import_to_package(import_name: str) -> str:
But obviously I'd have to query every single string. The table this PR generates for the cache is only ~300kb, but I'd imagine we'd want some pagination scheme to future-proof, if we allow for a fully query of the mapping.
I collected my thoughts together and opened this against conda-forge-metadata
: https://github.com/conda-forge/conda-forge-metadata/issues/55
We can add an API to the file import_name_priority_mapping.json
you are using. That file is made from the underlying mapping results from each package with some graph heuristics applied to resolve conflicts.
Resolves #181
The original ticket concept has changed a few times as others have pointed me to the
cf-graph-countyfair
data project.We now cache a modified version of Conda Forge's mapping that attempts to map irregular cases where the "import name" does not match the package name.
In theory, this will make our Python dependency calculation efforts more accurate.