Closed jaimergp closed 2 weeks ago
My findings so far:
The latest state of libcfgraph (last updated on Dec 4th 2023) contains:
We can naively put all the JSONs in a artifact -> JSON blob
table. Using the new JSONB, this takes 46GB uncompressed, 2.5GB zstd. However, finding files in this table is very slow because it needs to scan the JSON blob for each artifact. Adding a virtual JSON index doesn't help much and it increases storage significantly. On the upside, this can store ALL metadata in a very simple way and takes <10 min to populate.
To optimize for file querying, I also created a database with a file_path -> conda_artifacts
table, indexed by file_path
. The conda_artifacts
field is a text field where each line is a conda artifact "route" (channel/subdir/filename). This has a lot of duplication, but the exact queries are blazingly fast. It takes 37GB uncompressed, but compresses nicely to 850MB zstd! We can also add a FTS5 index for the paths, which allows for fast partial searches at a relatively small cost.
The code is available in this repository: https://github.com/jaimergp/conda-forge-paths. I added a GHA workflow, but the runner dies trying to clone libcfgraph 🚀 😂 My plan is to upload a couple of database.zst
files to GH releases and have that a starting point.
Hm, I learnt about RETURNING
and realized we can store the artifact paths on the go at no cost, and instead store the IDs, which should have little cost at query time. I added full-text-search to enable partial searches as well, and didn't change the size significantly. This all means that with this new approach the uncompressed database is only 8.8GB! Compressed size doesn't change much: 634MB.
We also get a new table for free: all the artifacts, and I also stored the timestamps, which will be useful at update time.
The https://github.com/jaimergp/conda-forge-paths repo is now up-to-date, and includes a datasette example.
$ ll path_to_artifacts.*
-rw-r--r-- 1 jrodriguez staff 8.8G Mar 8 16:07 path_to_artifacts.db
-rw-r--r-- 1 jrodriguez staff 634M Mar 8 16:30 path_to_artifacts.tar.zst
Demo search is now available at https://conda-metadata-app.streamlit.app/Search_by_file_path
@zklaus mentioned https://github.com/conda-forge/staged-recipes/pull/25862 which could be used to reduce storage on server.
Progress in https://github.com/jaimergp/conda-forge-paths: repo has self-updating releases (assuming it works) now. A systemd config has been added in the server too, so it updates itself every week. I'll close here once I see a working deployment/release :)
This task consists of building a sqlite database with all the package metadata. Equivalent to the deprecated
regro/libcfgraph:/artifacts
repository.