Add branch for historical data points

abitrolly commented 2 years ago

It would be convenient to store historical info in git as well.

Last refreshed on 2021-05-31 (62240 packages).

Right now to get the graph of how many packages was there over time, one needs to fetch the whole repository and then process git diff for index.html with some script. That not convenient.

The solution is to have a branch named data with stats added after each CI build. Stats should probably be in a separate files to prevent merge conflicts.

@vsoch can you recommend a simple way/scheme for the task? I remember I've seen some of your experiments with managing datasets in git.

vsoch commented 2 years ago

Not knowing the details of the project - the workflow can be fairly simple to checkout some data branch ref, generate the summary file for a point in time (likely will require cloning the whole repository) and then adding, committing, and pushing. As for the data type and organization of that branch it's largely up to you, and what you might want to do with it. if you want data that can render into a site you could make a Jekyll collection and put the data in markdown front end matter to render into the site. If you want to mimic a static API you can just dump into json and have some known scheme for getting a particular page.

Some examples I have (that might inspire) are caliper metrics that takes an approach to organize by UI, and then dump json named by the metric and then an index.json to summarize that. Or more complex is the RSEpedia that generates a site with Flask, and then saves the static pages (API and interfaces) there. Or any of these that use the GitHub API to do some search and then save results in some organized fashion to render from the repo:

https://github.com/vsoch/chonker-awards/blob/main/find-chonkers.py
https://github.com/spack/spack-stack-catalog
https://github.com/singularityhub/singularity-catalog
watchme uses git as a database to show changes of a file over time.

I haven't used it, but you could also look into a tool to handle this for you like Datalad.

Hope that helps! Let me know if you have specific questions.

abitrolly commented 2 years ago

Thanks for the quick feedback. I essentially need to run, but just a few fast comments.

watchme is the tool that I remember. Right.
The output produced by sources in this repo is already automatically uploaded to gh-pages branch.

https://github.com/abitrolly/fedora-packages-static/blob/8787cbfb0dd0ed957996592465751c2bf7a110ce/.github/workflows/manual.yml#L46-L50

It should not be a problem to upload stuff to the data branch the same way. The only concern I have about directory layout structure, so that parallel steps won't conflict on commits. Maybe I am overthinking the problem while trying to eliminate all these points of failures.

vsoch commented 2 years ago

oh that makes sense! So watchme is probably different in terms of goals - with watchme the idea is that you can store files in git over time, and then assemble all the results in one place upon an extraction. E.g., git A has a json structure with value 2, git timepoint B has 3, the results will be [2,3]. But it sounds like you want to be able to just get the entire summary statistics for a given timepoint. You are correct you'll hit a point of failure if you have, say, 100 of these running at once all trying to fetch and push - it's just hugely likely that upstream will change in the process and then the push will fail. Even for watchme, trying to run extractions on a cluster was a massive failure because git just isn't intended to work that way.

Are you able to have some control with respect to how many are running / pushing? If you think it's a reasonable number, I think it's reasonable to try, and you just need to come up with the directory organization to ensure that things are modular.

abitrolly commented 2 years ago

Yea, I think about 2 steps CI/CD pipeline.

[1]^^^^\
[2]------[4]-----[5]
[3]___/

[1],[2],[3] are writing to data/ folder in parallel data- branches. They use non-conflicting taskname-datetime-random format. The raw data. When they have finished, a job [4] merges these branches into data branch and processes raw data to update static JSON datasets.

This way there is always a static dataset in a single JSON file that can be used without additional filesystem loops.

abitrolly / fedora-packages-static

Add branch for historical data points #9