Open xaur opened 5 years ago
Have done a little research on this while working on the git/contributor analysis project. Thought I would share my general findings. Note: this is just what showed up on my google. Could be more out there.
Github help page on backing up repositories, https://help.github.com/articles/backing-up-a-repository/
There’s a (paid, starting $9/mo) 3rd party service promoted by GitHub that backs up repos daily. Presumably we do not want this, and if MS ever starts forcing people towards it (e.g. by restricting access to data via the GitHub API), presumably we want to start looking for alternatives. https://github.com/marketplace/backhub
A python tool that claims to back up everything, https://github.com/josegonzalez/python-github-backup Notes: It’s about a year old, and seems fairly actively maintained. Documentation scarce.
A bash script that backs up an entire org and associated repos, https://gist.github.com/rodw/3073987 Notes: This is a couple years old, but keeps getting commented on and added to. Last comment was in Sept.
Perl script that backs up GitHub repos, http://blogs.perl.org/users/steve_bertrand/2018/02/easily-back-up-your-github-repositories-andor-issues.html Backs up repos and Issues, but not sure it gets everything else. Also, Perl...
Related. In case we ever want to switch, GitLab has in import tool that looks like it gets pretty much everything, https://docs.gitlab.com/ee/user/project/import/github.html
Thanks for sharing these @s-ben, I'll have a look on next opportunity.
Also, Perl...
lol, good one.
Currently I saved issue pages with this quickly hacked wget script, but it is bad in so many ways: error prone, slow, does not incrementally update, etc
#!/usr/bin/env bash
wget --timestamping \
--adjust-extension \
--wait=1 \
--output-file "$(date --utc "+%Y%m%d%H%M%S").log" \
"https://github.com/xaur/decred-issues/issues/"{1..32}
wget. Use a shotgun, gonna be messy:)
Yeah I figure we can get one of these backup scripts working with too much effort.
Experimented with backing up issue texts to Git manually:
https://github.com/xaur/decred-issues-backup/tree/issues/issues
Manually it's too much hassle to maintain, even for me. With some motivation I think I could do it, but I can't imagine too many people to engage in this joyful practice with me.
A lot of good stuff is posted in comments that must be preserved too. Technically, maintainers of the tracker can press Edit button and copy-paste the source Markdown elsewhere, but manually processing the growing number of comments from everybody requires superpowers.
This stuff needs automation.
@s-ben is it possible to extract all issues, their comments, and all past versions of those comments via GitHub API?
@xaur is is definitely possible. The Decred contributor tracker that @degeri and I built currently pulls the following events from any repo we point it at and stores it all in a local database (could be in the cloud if we wanted as well):
However, there is already this archival project that looks like it stores all our stuff (it stores everything with an open source license). RR has been querying Decred's repos for research using Google's BigQuery tool, which can do quick searches on this data. https://www.gharchive.org/
Gitlab also has a fairly robust tool for stealing GitHub (MS) business by transferring everything automatically from GitHub to GitLab.
Ours is @RichardRed0x :) Sorry for pinging wrong person.
@s-ben good to know, so if those 'events' contain full comment text, it should be possible to pull all historic versions of all comments and save them in e.g. separate files, and programatically commit those files to a Git repo. This way it's possible to implement an append-only repository with archive of all messages in the issue tracker, that anybody can sync with a simple git clone
. I also assume it is possible to fetch events in a specified time range, this could be used to implement incremental updates of that archive repo.
@xaur this is definitely possible. On the GH Archive homepage they basically suggest you can do something like this:
Each archive contains JSON encoded events as reported by the GitHub API. You can download the raw data and apply own processing to it - e.g. write a custom aggregation script, import it into a database, and so on!
I think it's just a matter of how much resources this would take, and do we want to spend them.
Found a very interesting project: git-bug. It is a distributed issue tracker that stores issues (it calls them bugs) in Git objects. In the recent v0.4.0 release it added a GitHub importer so it can be used as an incremental (!) backup for issues, but not pull requests.
If they implement the other way, an exporter to GitHub, it can serve as a full offline client for GitHub.
Very cool. Funny you should mention this @xaur. I'm right now working on updating the contributor tracker so that it will also create full backups of commits, Issues, PRs, and comments on Issues and PRs. Wasn't setting out to do that, but in the process of updating the code to automate (non-controversial) repo-level dev stats (which we manually do for the Journal every month anyway), realized that it was easier to store serialized JSON blobs of commits, Issues, etc. in the database than hitting the GitHub API every time.
So, we should soon be technically "off the grid" so to speak, up to this point. This git-bug tool could be fed our historical data with some massaging, or just used moving forward should we need radical decentralization more than the convenience/productivity gains from GitHub.
Great! Do you plan to store issue comment edits too?
Damn you, @xaur. Just when I think I've made your decentralized dreams come true, you ask for more!
I don't think the GitHub API provides edits. We'd have to track those ourselves.
Hey guys, since I was incorrectly added to this thread, could you please remove me from future emails?
Thanks!
Richard Red | 朱祥瑞 richardred.q@gmail.com (678) 773-3594 https://www.linkedin.com/in/richardred
On Mon, Apr 1, 2019 at 12:41 AM Seth Benton notifications@github.com wrote:
I don't think the GitHub API provides edits. We'd have to track those ourselves.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xaur/decred-issues/issues/23#issuecomment-478470567, or mute the thread https://github.com/notifications/unsubscribe-auth/AVGOWkWWKF_PTJYEaEkEFCKCWKzEG133ks5vcbgjgaJpZM4ZSU5x .
Sorry about that other RR, removed the @ that added you. Hopefully you don't see this.
Sorry for bothering, if you still get the emails hit the Unsubscribe button to the right (via GitHub UI) or manage your subscribtions here https://github.com/notifications/subscriptions
GitHub took down youtube-dl. Git repo will likely show up elsewhere, but think of all the knowledge in issues and pull requests that was trusted to GitHub to store, is now not accessible. This is why we need to back up our stuff. Especially dcrd, which has a lot of design decisions documented in great detail.
GitHub stores a ton of valuable discussions in pull requests and issues that shall not be lost.
Find a way to 1) download issues and pull requests, 2) incrementally sync them.
Try: