Design and implement a database for the conda-forge graph and relevant metadata

jaimergp commented 1 year ago

📌 Summary

Migration of existing data sources into a purpose-built database, including a suitable data schema and schema validation.

📝 Background

The metadata supporting the diverse automations at conda-forge is made of a number of repositories that serve JSON files in a given directory structure.

Depending on the tool, the metadata will be used to build a graph. This data is downloaded, and the graph recreated, every time the job runs, adding to substantial overhead as conda-forge grows.

Other tools depend on a different presentation of the metadata, and are supported a different repository. This means that the same data is duplicated just to be presented in a different way.

This is not sustainable or considerate with the available resources, and we would prefer a single source of truth that runs in a performant way.

🚀 Tasks / Deliverables

[X] Enumerate data sources and their intended usage
- regro/cf-graph-countyfair: maintained and used by regro/cf-scripts (autotick bot). Also powers the status page and other things.
- regro/libcfgraph: maintained by regro/libcflib and used by several pieces. Most important pieces: run_exports data and the import maps.
- conda-forge/feedstock-outputs: maps package names to feedstocks that are allowed to generate them.
- conda-forge/artifact-validation: map of file paths to originating artifacts. Currently not available in a global way, only for some special packages.
[x] Provide alternative sources for some metadata
- [x] libcfgraph's artifacts/ can be fetched from GHCR OCI mirror: https://github.com/regro/conda-forge-metadata/pull/6
- [x] Needs https://github.com/conda-forge/staged-recipes/pull/22555
  - [X] Needs a new release: https://github.com/channel-mirrors/conda-oci-mirror/releases/tag/v0.1.0
  - [X] Needs https://github.com/channel-mirrors/conda-oci-mirror/pull/27
- [X] Provide a web frontend for this backend: https://conda-metadata-app.streamlit.app/
  - [x] Link to backend from conda-forge.org/packages?
- [x] feedstock-outputs: https://github.com/regro/conda-forge-metadata/pull/7
- [x] Needs https://github.com/conda-forge/feedstock-outputs/pull/44
- [x] Enables:
  - [x] https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/233
  - [x] https://github.com/conda-forge/conda-forge-webservices/pull/552
- [x] Investigate if run_exports can be served next to repodata.json files:
- [x] https://github.com/conda-incubator/ceps/pull/51
  - [x] https://github.com/conda/schemas/pull/25
  - [x] https://github.com/conda/conda-index/pull/110
  - [X] Have Anaconda.org enable --run-exports and serve those files (example run_exports.json)
  - [X] Add to regro/conda-forge-feedstock-check-solvable - https://github.com/regro/conda-forge-feedstock-check-solvable/pull/17
- [x] https://github.com/conda/conda-index/issues/102
[x] Design a database schema and implementation that strikes a balance of performance and maintainability
- [x] https://github.com/Quansight-Labs/czi-conda-forge-db/pull/5
[ ] ~Devise am automated, provider-agnostic deployment strategy~
[ ] ~Consolidate the existing data sources into the proposed database~
[ ] ~Study how the existing bot infrastructure can benefit from the database to inform the next steps~

📅 Estimated completion

This task should be finished in the first 12 months.

ℹ️ References

jaimergp commented 1 year ago

Some updates:

libcfgraph suffered from some file size issues earlier this month, which led us to investigate (along with other core members) possible ways forward. More details in this document.
The immediate issue was patched but the underlying problem is still there. We need to move this infra to a database ASAP. Digital Ocean's managed databases look a good compromise of features and maintenance requirements.
@viniciusdc is working on a database schema to migrate libcfgraph to a PostgreSQL DO instance.

viniciusdc commented 1 year ago

Here's a summary of the work done so far using SQLite (will move to PostgreSQL later) to create a mock-up table based on a small set of the libcfgraph repo:

Created a SQLite database and table with the necessary fields to store the data from the libcfgraph repo, following this initial set of reqs:

Artifacts: Rows are indexed by artifact name; Package name, subdir/platform, hash…
Imported two subsets of the libcfgraph repo data into the SQLite table for testing purposes: one with 151M and another with 8.8G, respectively 0.01% and 0.5% of a partial clone.
Tested various queries on the data in the SQLite table and compared the results with the same queries on the original data using for loops. The queries in the SQLite table showed significant improvements in processing time, once a way to do them was found.

Although the schema will certainly change later, this initial work with SQLite has been a promising start.

As a second task, working on finding a good structure for the layout data in the tables so that querying for which package has certain libs/dlls/... would be easier. Specifically, trying to adapt what Marius did in a previous work.

viniciusdc commented 1 year ago

A quick follow-up on the last part of the previous answer: something I noticed so far is that if we're looking for faster querying of the JSON blobs directly, as in Marius's notebook linked above, having a single table containing all the artifacts might not be the best scenario because that could lead to complex queries and could generate some burden.

So I've found two directions so far:

One approach would be to create a separate table for the payloads, with a foreign key relationship to the artifacts table. As explained in the previous answer, this would allow us to query the payloads independently of the artifacts table.

CREATE TABLE artifacts (
  id INTEGER PRIMARY KEY,
  package_name TEXT,
  channel TEXT,
  path TEXT,
  platform TEXT,
  hash TEXT,
  payload_id INTEGER,
  FOREIGN KEY (payload_id) REFERENCES payloads(id)
);

CREATE TABLE payloads (
  id INTEGER PRIMARY KEY,
  payload JSON
);

Another option would be to maintain the current scenario but reduce the data relationships in those JSON blobs by only including the necessary fields for querying. This would require some upfront analysis and data cleansing to determine which fields are essential for querying, but it could simplify queries in the long run.

Which approach we choose will depend on the specific needs of our application, but it's worth considering all options before making a decision.

viniciusdc commented 1 year ago

Hey there!

I was trying to figure out the best way to manage and query some import_to_pkg_maps files. I tested out two different approaches:

Single table: one table with two columns - "package_name" (from the file name) and "payload" (the contents of the file)
Multi-table: two tables, "packages" and "package_contents", using a BTREE structure to store the data

Here's what I found!

Multi-table was better in terms of flexibility and efficiency
The total size of the database was smaller (36MB for multi-table vs. 56MB for single table)
Query execution time was faster for multi-table (47ms vs. 494ms for single table)
To make sure both tables covered all the data, I created a parquet file following the structure of the multi-table approach
The parquet file was smaller (9.6MB) and searching through it was faster (about 25ms using pandas)

Obs.: For each query, I inspected the values of the JSON blobs in the look for this element "access"

to = time.time(); p.loc[p['field_value'] == 'access']; print(f"Took {(time.time() - to) * 1000} ms")
Took 24.831056594848633 ms

index	package_id	package_name	content_id	field_name	field_value
294131	5221	acces	294132	access	access
294132	5221	acces	294133	access.access	access
294133	5221	acces	294134	access.datasets	access
294134	5221	acces	294135	access.fca	access
294135	5221	acces	294136	access.helpers	access
294136	5221	acces	294137	access.raam	access
294137	5221	acces	294138	access.weights	access

dhirschfeld commented 1 year ago

As a duckdb fan I'm wondering if that wouldn't be a better choice.

dhirschfeld commented 1 year ago

Just popping up here to ask if it would be possible/feasible to have an insert_timestamp column for reproducibility so that you could query the database for data only entered before a specified timestamp.

In that way you could ensure the result of your query would always be the same if you specified an insert_timestamp <= '<some-date>'. If data is updated after the fact (e.g. repodata patches) it can make it difficult to reproduce previous environments / analytics results.

jaimergp commented 1 year ago

Hi @dhirschfeld! We are not storing repodata stuff in this effort. AFAIK the timestamp field is rarely patched so it should work regardless. You can also get the unpatched repodata if you download repodata_from_packages.json, instead of the regular repodata.json :D

dhirschfeld commented 1 year ago

The point is you don't know when the repodata was patched. If you run an install command before the patch you'll get a different result than after.

For reproducibility you want to be able to say create an environment exactly the same as would have been created if I ran the command at <specific-timestamp>. AFAIK that's not possible with the current repodata/patches mechanism? However, it would be easy if the data were stored in a database with the insert timestamp recorded.

...but maybe that's not what this issue is about, if so, apologies for the noise 😬!

Edit: It's only a nice-to-have but not so important if you have a lock file for your environment.

viniciusdc commented 1 year ago

General feedback from some exploration of the data stored at feedstock outputs, I am confident that TRIE and K-tree-like structures generally work for scratch implementation that relies only on those mechanisms efficiently. If we move to SQL-based database, I am starting not to see the benefit of replicating this data disposition into the tables... as they usually don't transpose the same stability in the context of plain relational databases. (excellent short post on the subject)

Anyhow, I was able to test (a lot, but I will keep to two models here) two main approaches:

Replicating the TRIE like structure, by using a child x parent_dir approach for keeping the hierarchical data disposition (example3.db) -- takes up to ~15min to populate all the tables, this can be improved
Keeping a simple relational database implementation using only extra columns with the original relational_path stored. (test7.db) -- takes around 1 min to populate all the tables

In general, their sizes of then were on the same average:

$ du -sh /home/vcerutti/Conda-Forge/feedstock-outputs/outputs
98M     /home/vcerutti/Conda-Forge/feedstock-outputs/outputs

$ du -sh test7.py.sqlite
3,2M    test7.py.sqlite

$ du -sh example3.db
1,2M    example3.db

I still haven't explored much on how we would create the APIs for each example (though it's more straightforward to perform than using SQLalchemy schemas as the entry points), so I still can't say much about performance behavior when querying, as I was focused on reducing the amount of space required to store these data.

I will follow up next week with some examples.

beckermr commented 1 year ago

Ah I see note that quetz already had a plug-in to do this. We might look at their table structure and indices.

jaimergp commented 1 year ago

Oh, nice! I see there are a few interesting resources:

jaimergp commented 1 year ago

Hi @viniciusdc, I am writing this here also for me. I just realized we had written down conda-suggest-conda-forge too as a potential database table (see references in 1st message).

The good thing is that we can generate those from the files-to-artifact map without an issue. The logic comes from this package.

jaimergp commented 5 months ago

Idea: we can serve the database from a small VM in open-gpu-server. It'd be easily redeployable. This can serve the files-to-artifacts search endpoint, I'd say?

beckermr commented 5 months ago

Def a good idea for testing. I am nervous about hosting services. We already have too much.

jaimergp commented 4 months ago

Ok, so next steps in this almost finished items:

Link items in conda-forge.org/packages to conda-metadata-app entries. In the future, maybe we have an API serving the frontend directly. See https://github.com/Quansight-Labs/czi-conda-forge-mgmt/issues/52
We also need to do something about the Python import maps. Ideally, this is computed from the files-to-artifacts mapping. See https://github.com/Quansight-Labs/czi-conda-forge-mgmt/issues/53.
Server files-to-artifacts mapping somewhere reliable with close-to-zero maintenance. See https://github.com/Quansight-Labs/czi-conda-forge-mgmt/issues/54.

Closing this one for actionability 🚀

Quansight-Labs / czi-conda-forge-mgmt