edgi-govdata-archiving / archivers.space

šŸ—„ Event data management app used at DataRescues
https://www.archivers.space/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Discussion: What would it be like to manage all this through GitHub repositories? #50

Open danielballan opened 7 years ago

danielballan commented 7 years ago

This is a question for long-term discussion, decoupled from the ongoing useful efforts to make the app more usable over the next 2-3 months.

The conda-forge project (which I have contributed to) manages a community of volunteers who adopt software packages they care about and collaboratively create and maintain scripts for building binaries for those packages. The scripts that they write are automatically executed using free CI services, and the resultant artifacts are uploaded to a common public site. Each software package is assigned a separate repo with a tiny subcommunity of users who follow notifications and perform maintenance.

This leads to a lot of repos, so conda-forge uses custom bots and the GH API to impose additional structure, keepings things organized and as automated as possible.

I see an analogy to our community of volunteers: we intend to adopt subdomains or sub-sections of subdomains, collaboratively write and maintain scripts that capture their data, execute those scripts on a server and upload the results. Once the Bagging phase takes place on a remote server, the task of our archives.space app will be reduced to Research/Checking and uploading a harvesting script to a server. These sound like tasks that could be managed with GH labels and milestones and comments, with harvesting scripts coming in through pull requests.

Maybe conda forge's model could work for us. What do you think? In what ways are our needs similar and different?

kmcculloch commented 7 years ago

I'm all for learning from other collaborative projects, and the model you're describing sounds cool. Making a "phase 2" label so we can flag open ended discussions and issues that we might want to hold off on.

danielballan commented 7 years ago

Some feedback on Slack is telling me I should elaborate a little. Here goes.

If I want to start capturing a new subdomain or section of a website, I do this:

dcwalk commented 7 years ago

Oh interesting. Thanks for the clarification @danielballan

Thoughts/questions:

danielballan commented 7 years ago

Good questions and observations.

Would these just be for harvesting 'tools'?

Yes, I think it makes sense for each repo to contains scripts that are customized from a standard set. Some could be easier to standardize (i.e., recursively FTP this URL). Others would be highly customized to a tricky site, and that would be OK too.

I'm still wondering if we aren't underestimating the difficulty in programmatically extracting/dlding all of the 'uncrawlable' data.

This is a system for tracking individual scripts for individual subdomains / sections. I think the system is still useful even if each script is slightly different from every other script and even if a couple scripts are quite unique. But if almost all the scripts bear almost no resemblance to one another, than the value of this system is limited. I'm not sure yet how customized the scripts need to be on average.

dcwalk commented 7 years ago

Right, totally get the advantages of a system for tracking individual scripts šŸ‘
To your last point -- I'm not sure either, but I there are definitely patterns, @janetriley for instance has been working to identify these patterns: https://gist.github.com/janetriley/6dd97c6997edebe9c25f17a0ce8a77d4

titaniumbones commented 7 years ago

So, this is pretty neat. There are some technical bits I can't assess, but assuming you have those all in hand šŸ˜„ , I have two remaining questions:

I think it would be cool to try to test this out on a relatively small scale first if poss, if only b/c I'm worried about the latter of these two concerns.

danielballan commented 7 years ago

How much of a bottleneck will admin approval be?

I think any technically savvy volunteer could be endowed with commit rights with little risk. We don't necessary need to hold up the process with approval from Owner-level trusted contributors; we just need to know that two or three people have read a script and think that it works for scraping a site.

What happens [if we discover the script doesn't work]?

Having the scripts public on GitHub makes them immediately easier to check than our current situation --- scripts buried in a "tools" directory in a zip file that most people can't even access. I don't have any ideas at this point how to manage ongoing verification. We'll have to develop a process as we go. This is one respect where what we're doing is different than what conda-forge is doing: it's relatively easy for them to tell that their recipes are in working order.

I think it would be cool to try to test this out on a relatively small scale first.

Definitely. In its early days, conda-forge actively avoided building up a large repertoire, focusing instead on tackling a handful of challenging examples to test their model and refine their automated tools. We should do the same.

I don't think there's any big rush on this. But if this is a direction we decide to explore further, I think it would be worth having a small meeting with the core conda-forge developers to get their input and learn from their experience. I have working relationships with three of them and could set that up.