Discussion: What would it be like to manage all this through GitHub repositories?

edgi-govdata-archiving / archivers.space

🗄 Event data management app used at DataRescues

https://www.archivers.space/

GNU Affero General Public License v3.0

6 stars 3 forks source link

Discussion: What would it be like to manage all this through GitHub repositories? #50

Open danielballan opened 7 years ago

danielballan commented 7 years ago

This is a question for long-term discussion, decoupled from the ongoing useful efforts to make the app more usable over the next 2-3 months.

The conda-forge project (which I have contributed to) manages a community of volunteers who adopt software packages they care about and collaboratively create and maintain scripts for building binaries for those packages. The scripts that they write are automatically executed using free CI services, and the resultant artifacts are uploaded to a common public site. Each software package is assigned a separate repo with a tiny subcommunity of users who follow notifications and perform maintenance.

This leads to a lot of repos, so conda-forge uses custom bots and the GH API to impose additional structure, keepings things organized and as automated as possible.

I see an analogy to our community of volunteers: we intend to adopt subdomains or sub-sections of subdomains, collaboratively write and maintain scripts that capture their data, execute those scripts on a server and upload the results. Once the Bagging phase takes place on a remote server, the task of our archives.space app will be reduced to Research/Checking and uploading a harvesting script to a server. These sound like tasks that could be managed with GH labels and milestones and comments, with harvesting scripts coming in through pull requests.

Maybe conda forge's model could work for us. What do you think? In what ways are our needs similar and different?

kmcculloch commented 7 years ago

I'm all for learning from other collaborative projects, and the model you're describing sounds cool. Making a "phase 2" label so we can flag open ended discussions and issues that we might want to hold off on.

danielballan commented 7 years ago

Some feedback on Slack is telling me I should elaborate a little. Here goes.

If I want to start capturing a new subdomain or section of a website, I do this:

Fork a special repository called 'staged-recipes'.
Create a new folder in that repository, generated from a template that contains standard harvesting scripts.
Customize the standard harvesting script(s) provided by the template to scrape the URLs in that subdomain / section.
Test the scripts on my personal computer to make sure they work, but don't bother scraping everything or saving the data.
Submit this new directory of scripts in a pull request to 'staged-recipes'.
Wait for that PR to be reviewed/merged by admins. (Examples of staged recipes at conda-forge)
Here's where it gets different: via some magical GH automation, merging the PR automatically generates a new repository containing just that one recipe folder (examples). Future maintenance of the recipe is performed on that repository. As the PR submitter, I automatically get commit rights on that repo, and I can invite others who want to help maintain it to also watch/commit.
Every time the repo is updated, a continuous integration service running in the cloud automatically executes the scraping scripts. (Or runs them periodically...this would be configurable.)
Whenever the standard scripts are improved, a bot automatically goes through and suggests changes (via a PR) to all the repositories. The maintainers of each repo can verify and merge or reject the suggested changes.

dcwalk commented 7 years ago

Oh interesting. Thanks for the clarification @danielballan

Thoughts/questions:

It looks like this is handled separately than conda development, which makes sense
Would these just be for harvesting 'tools'? Conda appears to be a relatively structured environment, in a sense we are less structured now (or it feels that way)
I'm still wondering if we aren't underestimating the difficulty in programmatically extracting/dlding all of the 'uncrawlable' data. I think the ruby watir script is a good example of both the opportunity for automation and the challenges/bespoke-ness

danielballan commented 7 years ago

Good questions and observations.

Would these just be for harvesting 'tools'?

Yes, I think it makes sense for each repo to contains scripts that are customized from a standard set. Some could be easier to standardize (i.e., recursively FTP this URL). Others would be highly customized to a tricky site, and that would be OK too.

I'm still wondering if we aren't underestimating the difficulty in programmatically extracting/dlding all of the 'uncrawlable' data.

This is a system for tracking individual scripts for individual subdomains / sections. I think the system is still useful even if each script is slightly different from every other script and even if a couple scripts are quite unique. But if almost all the scripts bear almost no resemblance to one another, than the value of this system is limited. I'm not sure yet how customized the scripts need to be on average.

dcwalk commented 7 years ago

Right, totally get the advantages of a system for tracking individual scripts 👍
To your last point -- I'm not sure either, but I there are definitely patterns, @janetriley for instance has been working to identify these patterns: https://gist.github.com/janetriley/6dd97c6997edebe9c25f17a0ce8a77d4

titaniumbones commented 7 years ago

So, this is pretty neat. There are some technical bits I can't assess, but assuming you have those all in hand 😄 , I have two remaining questions:

how much of a bottleneck will admin approval be? On an event day, there may well be many dozens of requests coming in.
what happens in the rare (but maybe not so rare) case where you discover after 50 or 100 downloads that you're not actually getting everything you thought you were getting? Is this something for "checkers" to manage in the later stages of the process?

I think it would be cool to try to test this out on a relatively small scale first if poss, if only b/c I'm worried about the latter of these two concerns.

danielballan commented 7 years ago

How much of a bottleneck will admin approval be?

I think any technically savvy volunteer could be endowed with commit rights with little risk. We don't necessary need to hold up the process with approval from Owner-level trusted contributors; we just need to know that two or three people have read a script and think that it works for scraping a site.

What happens [if we discover the script doesn't work]?

Having the scripts public on GitHub makes them immediately easier to check than our current situation --- scripts buried in a "tools" directory in a zip file that most people can't even access. I don't have any ideas at this point how to manage ongoing verification. We'll have to develop a process as we go. This is one respect where what we're doing is different than what conda-forge is doing: it's relatively easy for them to tell that their recipes are in working order.

I think it would be cool to try to test this out on a relatively small scale first.

Definitely. In its early days, conda-forge actively avoided building up a large repertoire, focusing instead on tackling a handful of challenging examples to test their model and refine their automated tools. We should do the same.

I don't think there's any big rush on this. But if this is a direction we decide to explore further, I think it would be worth having a small meeting with the core conda-forge developers to get their input and learn from their experience. I have working relationships with three of them and could set that up.