Open abitrolly opened 2 years ago
Not knowing the details of the project - the workflow can be fairly simple to checkout some data branch ref, generate the summary file for a point in time (likely will require cloning the whole repository) and then adding, committing, and pushing. As for the data type and organization of that branch it's largely up to you, and what you might want to do with it. if you want data that can render into a site you could make a Jekyll collection and put the data in markdown front end matter to render into the site. If you want to mimic a static API you can just dump into json and have some known scheme for getting a particular page.
Some examples I have (that might inspire) are caliper metrics that takes an approach to organize by UI, and then dump json named by the metric and then an index.json to summarize that. Or more complex is the RSEpedia that generates a site with Flask, and then saves the static pages (API and interfaces) there. Or any of these that use the GitHub API to do some search and then save results in some organized fashion to render from the repo:
I haven't used it, but you could also look into a tool to handle this for you like Datalad.
Hope that helps! Let me know if you have specific questions.
Thanks for the quick feedback. I essentially need to run, but just a few fast comments.
watchme
is the tool that I remember. Right.
The output produced by sources in this repo is already automatically uploaded to gh-pages
branch.
data
branch the same way. The only concern I have about directory layout structure, so that parallel steps won't conflict on commits. Maybe I am overthinking the problem while trying to eliminate all these points of failures.oh that makes sense! So watchme is probably different in terms of goals - with watchme the idea is that you can store files in git over time, and then assemble all the results in one place upon an extraction. E.g., git A has a json structure with value 2, git timepoint B has 3, the results will be [2,3]. But it sounds like you want to be able to just get the entire summary statistics for a given timepoint. You are correct you'll hit a point of failure if you have, say, 100 of these running at once all trying to fetch and push - it's just hugely likely that upstream will change in the process and then the push will fail. Even for watchme, trying to run extractions on a cluster was a massive failure because git just isn't intended to work that way.
Are you able to have some control with respect to how many are running / pushing? If you think it's a reasonable number, I think it's reasonable to try, and you just need to come up with the directory organization to ensure that things are modular.
Yea, I think about 2 steps CI/CD pipeline.
[1]^^^^\
[2]------[4]-----[5]
[3]___/
[1],[2],[3] are writing to data/
folder in parallel data-
branches. They use non-conflicting taskname-datetime-random
format. The raw data. When they have finished, a job [4] merges these branches into data
branch and processes raw data to update static JSON datasets.
This way there is always a static dataset in a single JSON file that can be used without additional filesystem loops.
It would be convenient to store historical info in git as well.
Right now to get the graph of how many packages was there over time, one needs to fetch the whole repository and then process
git diff
forindex.html
with some script. That not convenient.The solution is to have a branch named
data
with stats added after each CI build. Stats should probably be in a separate files to prevent merge conflicts.@vsoch can you recommend a simple way/scheme for the task? I remember I've seen some of your experiments with managing datasets in
git
.