Quansight-Labs / czi-conda-forge-mgmt

🚀 Top level project management for conda-forge CZI grant
https://github.com/orgs/Quansight-Labs/projects/10
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

Files-to-artifacts database / API / mapping #54

Closed jaimergp closed 2 weeks ago

jaimergp commented 4 months ago

Provide a way for users to find which package(s) provide a certain file (e.g. a header, or a library, or an executable), similar to what portals like pkgs.org do.

We do have the info in the database designed in https://github.com/Quansight-Labs/conda-forge-db, but we need to serve it somewhere, preferably serverless or close-to-zero maintenance (e.g. one-click deployment). This is tricky because populating the database from scratch has a non-negligible overhead.

### Tasks
- [ ] https://github.com/Quansight-Labs/czi-conda-forge-mgmt/issues/58
- [ ] https://github.com/Quansight-Labs/czi-conda-forge-mgmt/issues/59
jaimergp commented 3 months ago

We talked with Matt last week and we may be able to unblock this. The main issue is deployment and maintenance of infrastructure. We have several venues to explore:

jaimergp commented 3 months ago

@zklaus shared some progress about the git-db prototype in today's mgmt call. Can you add some summary here? 🙏

Also some numbers to give an idea of the scale we are dealing with:

zklaus commented 3 months ago

The main idea is to store the mapping in a bare git repository. By using libgit2 via its Python binding pygit2 we avoid the need to create a huge tree on the filesystem. I have created a prototype at https://github.com/zklaus/cfgraphman which is able to add individual artifacts from their json info to the Git odb. It remains to be seen how this scales, which will be subject of further investigation over no more than this and the next week.

jaimergp commented 2 weeks ago

https://github.com/jaimergp/conda-forge-paths is now ready as a self-updating sqlite releaser, which is then queried by a VM in the GPU CI server. This VM has a systemd config and a crontab downloads the latest sqlite dump every Tuesday, restarts the datasette instance and voilà. A bit barebones but I think it will work.

https://conda-metadata-app.streamlit.app/Search_by_file_path has the UI-friendly prototype :)