Quansight-Labs / czi-conda-forge-mgmt

πŸš€ Top level project management for conda-forge CZI grant
https://github.com/orgs/Quansight-Labs/projects/10
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

Design and implement a database for the conda-forge graph and relevant metadata #5

Closed jaimergp closed 4 months ago

jaimergp commented 1 year ago

πŸ“Œ Summary

Migration of existing data sources into a purpose-built database, including a suitable data schema and schema validation.

πŸ“ Background

The metadata supporting the diverse automations at conda-forge is made of a number of repositories that serve JSON files in a given directory structure.

Depending on the tool, the metadata will be used to build a graph. This data is downloaded, and the graph recreated, every time the job runs, adding to substantial overhead as conda-forge grows.

Other tools depend on a different presentation of the metadata, and are supported a different repository. This means that the same data is duplicated just to be presented in a different way.

This is not sustainable or considerate with the available resources, and we would prefer a single source of truth that runs in a performant way.

πŸš€ Tasks / Deliverables

πŸ“… Estimated completion

This task should be finished in the first 12 months.

ℹ️ References

jaimergp commented 1 year ago

Some updates:

viniciusdc commented 1 year ago

Here's a summary of the work done so far using SQLite (will move to PostgreSQL later) to create a mock-up table based on a small set of the libcfgraph repo:

Although the schema will certainly change later, this initial work with SQLite has been a promising start.

As a second task, working on finding a good structure for the layout data in the tables so that querying for which package has certain libs/dlls/... would be easier. Specifically, trying to adapt what Marius did in a previous work.

image

viniciusdc commented 1 year ago

A quick follow-up on the last part of the previous answer: something I noticed so far is that if we're looking for faster querying of the JSON blobs directly, as in Marius's notebook linked above, having a single table containing all the artifacts might not be the best scenario because that could lead to complex queries and could generate some burden.

So I've found two directions so far:

CREATE TABLE artifacts (
  id INTEGER PRIMARY KEY,
  package_name TEXT,
  channel TEXT,
  path TEXT,
  platform TEXT,
  hash TEXT,
  payload_id INTEGER,
  FOREIGN KEY (payload_id) REFERENCES payloads(id)
);

CREATE TABLE payloads (
  id INTEGER PRIMARY KEY,
  payload JSON
);

Which approach we choose will depend on the specific needs of our application, but it's worth considering all options before making a decision.

viniciusdc commented 1 year ago

Hey there!

I was trying to figure out the best way to manage and query some import_to_pkg_maps files. I tested out two different approaches:

Here's what I found!

Obs.: For each query, I inspected the values of the JSON blobs in the look for this element "access"

to = time.time(); p.loc[p['field_value'] == 'access']; print(f"Took {(time.time() - to) * 1000} ms")
Took 24.831056594848633 ms
index package_id package_name content_id field_name field_value
294131 5221 acces 294132 access access
294132 5221 acces 294133 access.access access
294133 5221 acces 294134 access.datasets access
294134 5221 acces 294135 access.fca access
294135 5221 acces 294136 access.helpers access
294136 5221 acces 294137 access.raam access
294137 5221 acces 294138 access.weights access
dhirschfeld commented 1 year ago

As a duckdb fan I'm wondering if that wouldn't be a better choice.

dhirschfeld commented 1 year ago

Just popping up here to ask if it would be possible/feasible to have an insert_timestamp column for reproducibility so that you could query the database for data only entered before a specified timestamp.

In that way you could ensure the result of your query would always be the same if you specified an insert_timestamp <= '<some-date>'. If data is updated after the fact (e.g. repodata patches) it can make it difficult to reproduce previous environments / analytics results.

jaimergp commented 1 year ago

Hi @dhirschfeld! We are not storing repodata stuff in this effort. AFAIK the timestamp field is rarely patched so it should work regardless. You can also get the unpatched repodata if you download repodata_from_packages.json, instead of the regular repodata.json :D

dhirschfeld commented 1 year ago

The point is you don't know when the repodata was patched. If you run an install command before the patch you'll get a different result than after.

For reproducibility you want to be able to say create an environment exactly the same as would have been created if I ran the command at <specific-timestamp>. AFAIK that's not possible with the current repodata/patches mechanism? However, it would be easy if the data were stored in a database with the insert timestamp recorded.

...but maybe that's not what this issue is about, if so, apologies for the noise 😬!

Edit: It's only a nice-to-have but not so important if you have a lock file for your environment.

viniciusdc commented 1 year ago

General feedback from some exploration of the data stored at feedstock outputs, I am confident that TRIE and K-tree-like structures generally work for scratch implementation that relies only on those mechanisms efficiently. If we move to SQL-based database, I am starting not to see the benefit of replicating this data disposition into the tables... as they usually don't transpose the same stability in the context of plain relational databases. (excellent short post on the subject)

Anyhow, I was able to test (a lot, but I will keep to two models here) two main approaches:

In general, their sizes of then were on the same average:

$ du -sh /home/vcerutti/Conda-Forge/feedstock-outputs/outputs
98M     /home/vcerutti/Conda-Forge/feedstock-outputs/outputs

$ du -sh test7.py.sqlite
3,2M    test7.py.sqlite

$ du -sh example3.db
1,2M    example3.db

I still haven't explored much on how we would create the APIs for each example (though it's more straightforward to perform than using SQLalchemy schemas as the entry points), so I still can't say much about performance behavior when querying, as I was focused on reducing the amount of space required to store these data.

I will follow up next week with some examples.

beckermr commented 1 year ago

Ah I see note that quetz already had a plug-in to do this. We might look at their table structure and indices.

jaimergp commented 1 year ago

Oh, nice! I see there are a few interesting resources:

jaimergp commented 1 year ago

Hi @viniciusdc, I am writing this here also for me. I just realized we had written down conda-suggest-conda-forge too as a potential database table (see references in 1st message).

The good thing is that we can generate those from the files-to-artifact map without an issue. The logic comes from this package.

jaimergp commented 5 months ago

Idea: we can serve the database from a small VM in open-gpu-server. It'd be easily redeployable. This can serve the files-to-artifacts search endpoint, I'd say?

beckermr commented 5 months ago

Def a good idea for testing. I am nervous about hosting services. We already have too much.

jaimergp commented 4 months ago

Ok, so next steps in this almost finished items:

Closing this one for actionability πŸš€