Quansight-Labs / czi-conda-forge-mgmt

🚀 Top level project management for conda-forge CZI grant
https://github.com/orgs/Quansight-Labs/projects/10
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

Sqlite-based files-to-artifacts database deployment #59

Closed jaimergp closed 2 weeks ago

jaimergp commented 3 months ago

This task consists of building a sqlite database with all the package metadata. Equivalent to the deprecated regro/libcfgraph:/artifacts repository.

jaimergp commented 3 months ago

My findings so far:

The code is available in this repository: https://github.com/jaimergp/conda-forge-paths. I added a GHA workflow, but the runner dies trying to clone libcfgraph 🚀 😂 My plan is to upload a couple of database.zst files to GH releases and have that a starting point.

jaimergp commented 3 months ago

Hm, I learnt about RETURNING and realized we can store the artifact paths on the go at no cost, and instead store the IDs, which should have little cost at query time. I added full-text-search to enable partial searches as well, and didn't change the size significantly. This all means that with this new approach the uncompressed database is only 8.8GB! Compressed size doesn't change much: 634MB.

We also get a new table for free: all the artifacts, and I also stored the timestamps, which will be useful at update time.

The https://github.com/jaimergp/conda-forge-paths repo is now up-to-date, and includes a datasette example.

$ ll path_to_artifacts.*
-rw-r--r--  1 jrodriguez  staff   8.8G Mar  8 16:07 path_to_artifacts.db
-rw-r--r--  1 jrodriguez  staff   634M Mar  8 16:30 path_to_artifacts.tar.zst
jaimergp commented 3 months ago

Demo search is now available at https://conda-metadata-app.streamlit.app/Search_by_file_path

jaimergp commented 3 months ago

@zklaus mentioned https://github.com/conda-forge/staged-recipes/pull/25862 which could be used to reduce storage on server.

jaimergp commented 3 weeks ago

Progress in https://github.com/jaimergp/conda-forge-paths: repo has self-updating releases (assuming it works) now. A systemd config has been added in the server too, so it updates itself every week. I'll close here once I see a working deployment/release :)