edgi-govdata-archiving / archivers.space

🗄 Event data management app used at DataRescues
https://www.archivers.space/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Mechanism to allow harvesters to target sites/URLs that cater to their specialties for more efficiency #76

Open actnowchicagoarea opened 7 years ago

actnowchicagoarea commented 7 years ago

Posting this on behalf of some harvesters from my event last weekend. For a bit of context, I got some feedback from folks that a video walkthrough for harvesting could be beneficial. I misunderstood it as a video tutorial of the Archivers app. They clarified their problem to me via email below:

The challenge is in what to do once someone downloads a starter zip file. This is very different depending on the site and its interface.

I'm not sure how beneficial it would be to document the process for the site that Patrick and I harvested. I'm hesitant to do a video because I know it takes at least 4 hours to produce a decent 10-15 minute video. I'm not sure if it would be enough information to make someone who hasn't scraped a site with python able to do so with an arbitrary site. It might give people some useful tips, but there's a big gap between following what someone else is doing and being able to apply similar techniques to another site with a completely different interface. Plus, we really need to record a video to give an example of each method of harvesting.

I think a major issue with the current workflow is finding the right sites to work on. If I'm a harvester who is good at a particular method, I'd like to find other sites on which I can put that method to work. But, it might take 20 minutes of reading through researcher comments on various sites to find one. This implies that there should be a fixed list of harvesting methods from which the researchers choose their best guess, and that method would be displayed as an additional column on the list of sites which are waiting to be harvested. As a harvester, this allows me to work more efficiently and doesn't really increase the workload for the researchers.