Closed titaniumbones closed 5 years ago
I am thinking of setting up a RESTful API with Flask and a Postgres database to submit the URLs from the Chrome extension, as we have painfully learned the spreadsheet is a hassle. The Chrome extension can post to the db and in turn the Archivers app can also call to the db. Let me know how this sounds, i can start scoping out the plan that can accomadate everyone.
We have also discussed sending URLs directly from the extension to the app. On Tue, Feb 28, 2017 at 10:25 PM Sonal Ranjit notifications@github.com wrote:
I am thinking of setting up a REST API with Flask and a Postgres database to submit the URLs from the Chrome extension, as we have painfully learned the spreadsheet is a hassle. The Chrome extension can post to the db and in turn the Archivers app can also call to the db. Let me know how this sounds, i can start scoping out the plan that can accomadate everyone.
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/eot-nomination-tool/issues/72#issuecomment-283234257, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLIrlq5_cPQ_Hy-es5ZW5cey-CmfRXHks5rhOUrgaJpZM4MPH3U .
@danielballan what were/are the concerns around a direct integration? Are there Archivers issues we should pay attention do?
I recall this PR RE: integration: https://github.com/b5/pipeline/pull/74
The way it works now is, an admin clicks a link in archivers.space app, the app queries our Google Spreadsheet for new rows, and it ingests them into our URL collection. Ideally the admin would check the spreadsheet before clicking the link to be sure we haven't been bombarded with 100000 junk submissions. Either way, we track which URLs come from which ingestion, so we can always roll back a particular ingestion that contained junk.
If we want to get away from Google Spreadsheets, I think it's worth considering removing the middleman altogether. Instead of building a separate Flask app to cache submissions, add a POST request target to the archivers.space app that accepts URLs in a kind of staging area. The ingestion process can remain the same, from the admin user point of view, but everything will happen inside the app: moving from a sort-of "quarantined" staging area into the main database.
What do you think, @sonalranjit?
@danielballan it makes sense to add a POST request to the archivers app. Let me know how I can help with that, as far as I know the app is built off meteor.
Great. Please do! Yes, it's built using Meteor and it's in a repo that we are keeping private for now. I'll get you access....
@titaniumbones -- could you update us on any conversation with Internet Archive about how to proceed?
Comment from DataRefuge slack today:
A new nomination tool will be set up to which folks can keep contributing seeds and these will continue to come to us (Internet Archive). The gist is that some institutions will ramp down crawling so we have a "bookended" EOT collection. We will keep taking nominations and crawling, though we may direct them into a broader .gov/.mil effort.
Sorry for the delay. Um. I had understood things slightly differently... I'll ask Jefferson again. my thoughts as of today (which is a Thursday! and therefore nearly an event day!):
ah, I see that comment was from Jefferson. So... this suggests we can continue to provide seeds. In which case, we should likely figure out a better way of doing so than sending-spreadsheets-to-very-busy-people-every-week.
Sorry I'm so slow to see this, @titaniumbones, @danielballan, and @sonalranjit. I don't think there's a particular need to keep a spreadsheet of seed URLs we send to IA, especially if it's a hassle. It sounds like, whether we use the @sonalranjit's short-term implementation, or in the in-app staging area implementation suggested by @danielballan, that we will still have a DB where all of the seed URLs we've collected live permanently (and I think that makes sense, as it allows us to deduplicate at any point). Sounds like we could just have that place remain as our permanent record of what we've collected - and then find a convenient way to send those URLs on to IA.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
With the End of Term project finishing up, we need to rethink the Chrome extension.
we no longer need to send "seeds" to the Internet Archive. Instead, we can just send a request to their API, asking them to take a snapshot of it
if we want, we can take this opportunity to move from spreadsheets to a more robust database backend. @sonalranjit has lots of ideas about this.
finally, we could figure out a way for the archivers.space app to ingest URLs directly from @sonalranjit's database. @danielballan has done some similar work here: https://github.com/b5/pipeline/pull/74 and it would be nice to co-ordinate these efforts.
one concern: @ambergman , do you feel there're any reasons for us to be keeping records of the URL's we send to the IA? If so, what reasons? In particular, do you really want a spreadsheet for something?
Am hoping you all can start the ball rolling on this!
I'm hoping @dallan @