edgi-govdata-archiving / archivers.space

🗄 Event data management app used at DataRescues
https://www.archivers.space/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Should the App store non-data URL's? #63

Open titaniumbones opened 7 years ago

titaniumbones commented 7 years ago

We currently collect URL's for the app through the Chrome extension (https://github.com/edgi-govdata-archiving/eot-nomination-tool). We use the same tool to collect "seeds" for nomination to the Internet Archive's web crawls. Since we are planning to remove the error-prone spreadsheet stage from this process for datasets (https://github.com/edgi-govdata-archiving/archivers.space/issues/15), should we also use the app to store our seeds? If so, can we easily automate/routinize the process of handing seeds over to the IA for their crawling process?

dcwalk commented 7 years ago

Pinging @danielballan and @b5 here for input -- would be great to move forward on a new way of storing the seeds from the Chrome extension!

b5 commented 7 years ago

@danielballan has written an injest feature for the current app that treats the spreadsheet as a source-of-truth, unfortunately our recent refactoring has broken it. It'll take a day or two to get it back online, but it's on my radar.

I think this would be a very good idea moving forward in the new app. The only way to properly keep track of the url-to-content relationship is for the new app to keep a record of every url we've seen, and it's links to & from other urls. It would make a lot of sense to have the nomination tool talk directly to this new service. I've already written an api for this exact use-case, and all we'd need to do is store any additional backing data that the chrome extension may rely on.

Once we get the new service up & running I think the first stop would be to have the chrome extension send to both the spreadsheet & the new service, writing that integration should take no more than a day or two if we can get a chat going with the nomination tool authors. From there we can crawl all the urls on the spreadsheet to injest them into the new app, and start to think about a full transition.

As for doing this with the current archivers app, I'm presently too loaded down fleshing out the new service, but we'll open the service up tonight, and maybe we can get a kind human from the community to contribute a PR that kicks off this experiment with our current app?