edgi-govdata-archiving / eot-nomination-tool

📚 Chrome extension to nominate government data that needs to be preserved
https://chrome.google.com/webstore/detail/nominationtool/abjpihafglmijnkkoppbookfkkanklok
GNU General Public License v3.0
20 stars 10 forks source link

Alert that the URL has already been submitted while filling in the URL field #10

Closed atesgoral closed 7 years ago

atesgoral commented 7 years ago

It would be nice to warn the submitted -- prior to submitting -- that a URL already has been submitted.

titaniumbones commented 7 years ago

Agreed. that logic is a little harder to implement b/c we would have to query the underlying spreadsheet that the form submits to. To do that, the extension has to authenticate against google sheets... so, the backend is pretty unsatisfactory for solving this problem, it turns out.

shabscan commented 7 years ago

Quick thought: Are the people using the chrome extension trusted to view all the URLs submitted so far ? If yes, just run a async query to have the url map in memory so it can match against the current page and display a match even before submitting ( thus saving them time!!)

titaniumbones commented 7 years ago

Yes, I don't see why we wouldn't make that list freely accessible. Assigned to you! Feel free to unassign yourself!

titaniumbones commented 7 years ago

If you have any further thoughts on this @shabscan let us know. Beyond me though I think.

titaniumbones commented 7 years ago

Tagging @trinberg on this because I think we should start thinking about the long-term plans for these seeds. Ultimately I think we need a new backend that allows us to run more sophisticated queries on the db, e.g. giving raw numbers of seeds for all identified vulnerable programs, tying this back to some kind of visualization, etc... in any case, we could maybe start by figuring out whether it's practical to run a simple query that searches for the current URL in the spreadsheet before submitting.

trinberg commented 7 years ago

Yes, thanks for adding me to this - several people at Philly mentioned that this would be a great feature that would save time during an archive-a-thon. Especially with the current workflow - people are working on particular offices in groups of 5-6. Even if the app was cross-checking the sites at one archiving event - that would be a good start I think. I don't know much about this, but my concern is that when the list gets long - querying will take longer and the app may run slower.

Regarding longer term plans: completely agree @titaniumbones. We should think about how to attribute seeds and then archived pages/data to a db structure that reflects the agency org charts (including priorities) and also think about visualizing the same in a good way.

Will follow up on this...

sonalranjit commented 7 years ago

I agree with centralizing the records in a database for all these seeds. The extension can be then modified to post to a database and be generalized to be used at any event. This will allow for complex queries to be built on to the extension. From what I know there will have to be a intermediary webapp that will interface with a database to run its queries.

atesgoral commented 7 years ago

I'm very tempted to just dive in and create an API backend to handle this. But, I'd like to see if there's a cloud-based solution that we could concoct to avoid having to deploy a backend.

atesgoral commented 7 years ago

@paracycle, any thoughts?

paracycle commented 7 years ago

Thanks for the mention @atesgoral. I can recommend a few cloud based, light-weight database services, with APIs, which act like and are managed as a (glorified) spreadsheets on the web:

  1. Fieldbook: This one is more light-weight and more Excel like but with good database semantics and an API with clients in several languages. Especially the Chrome extension project based on Fieldbook API it might be of interest.
  2. Airtable: This one is more marketing itself around organizing information, so it's a slightly different approach to maintaining a database. It also has an API and several pre-existing integrations that you can use (think sending table update notifications to a Slack channel).

Finally, I would recommend Webtask if you are interested in a easy to use, easy to deploy server-less backend/glue solution to tie everything up.

titaniumbones commented 7 years ago

thx @paracycle , have not checked these out yet but as you look these over @atesgoral if the assessment of them all seems pretty even, it might be worth knowing that EDGI is using airtable for some other stuff.

trinberg commented 7 years ago

A few people brought this up as a great feature to have during the event today, so I wanted to see if there is still interest in pushing this forward. Is cross-checking the google sheet URL column a realizable goal?

atesgoral commented 7 years ago

I could take a stab at this. @paracycle wanna collab?

titaniumbones commented 7 years ago

everyone should recognize that WE DO NOT HAVE A SINGLE SHEET THAT STORES ALL URL's ANYMORE. So this would likely be an "in-event" tool -- if the event is well-targeted on a single or several agencies, this would stil lbe extremely helpful.

@trinberg, users should still be instructed to check the IA first before submission.

On 02/02/2017 08:14 AM, Ates Goral wrote:

I could take a stab at this. @paracycle https://github.com/paracycle wanna collab?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/eot-nomination-tool/issues/10#issuecomment-276953659, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNIJDeqp7MdA_12Ir44I5OqUxnA8Lks5rYdaegaJpZM4LP9Ww.

mhucka commented 7 years ago

If there's no single list of all URLs, then what about at least querying the wayback machine's API to ask if a given URL is already in the Internet Archive? Not perfect, but perhaps better than nothing?

titaniumbones commented 7 years ago

@mhucka I think that would be awesome. do you have time to implement (in popup.js)? If not I can try but I will be slow.

Can write a response in the same element currently being used to show success/failure messages maybe?

atesgoral commented 7 years ago

@mhucka I had suggested to make it even a bit more proactive in #35. Even without opening the popup, there can be a visual indicator on the toolbar if a non-archived page is visited. I don't know how robust the IA API is though...

titaniumbones commented 7 years ago

@atesgoral do you think that's something you can implement? it sounds great.

mhucka commented 7 years ago

@atesgoral That #35 is a great idea. Sorry I didn't see that.

@titaniumbones I don't want to promise (because I keep failing so miserably in keeping to them), but I think I can try to look into this next week.

mhucka commented 7 years ago

FYI: I just learned about a different API offered by the Wayback machine: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

This has a helluva lot more functionality than the Wayback machine API and is probably the better thing to use for any queries to IA.

titaniumbones commented 7 years ago

ah shit, I just wrote this up while you were typing -- in another bug no less, jeez -- anyway here's my take for hte record: The internet archive API can tell you if a URL is already in the archive via a simple API, so: http://archive.org/wayback/available?url=http://whitehouse.gov will return

{"archived_snapshots":{"closest":{"available":true,"url":"http://web.archive.org/web/20170202062552/https://www.whitehouse.gov/","timestamp":"20170202062552","status":"200"}}}

or w/ jquery:

var response $.getJSON("http://archive.org/wayback/available?url=http://whitehouse.gov");
console.log(response.responseJSON.archived_snapshots.closest.timestamp);

I would love to see this:

Is there anyone out there who thinks they might be able to od this by Saturday @mi-lee @sonalranjit @atesgoral @danielballan @geppy?

If you don't know what the pipeline app is, don't worry about it, we are moving the URL tracking out of Google forms finally ,eventually this app will also move.

@atesgoral if you think your other cooler idea can also work pls go ahead and implement that!!!!Sat morning is good, Friday even is better, later is also fine but then I will stop worrying about this. Thanks all

mhucka commented 7 years ago

Just for the record, the API you mentioned is what check-ia uses, but it is a suboptimal API – this is the API that's been given us the problem of occasional false positives. Apparently the CDX api is better and is the thing that IA people themselves use (according to Jefferson). So in terms of functionality, what you wrote above is fine, but for the implementation I recommend not using the API you mentioned there (the one that returns "archived_snapshots")

titaniumbones commented 7 years ago

so the preferred code would be:

r = $.getJSON("http://web.archive.org/cdx/search/cdx", {url:"whitehouse.gov", limit:-1, output:"json"})
ts=r.responseJSON[1][1]
if (do-compare-dates-in-js-however-thats-done(now,ts) > 1month) { {is_archived}} else {{! is_archived}};

yes?

atesgoral commented 7 years ago

I think without an explicit "Check" button, the popup can go ahead and kick off the check in the background and only enable the Submit button after the lookup has finished/failed/timed out. Aim: reducing the number of clicks the user has to perform. I think we can be agile with this and implement it first in the popup (with the API mentioned above) and then worry about my suggestion in #35 as a nice-to-have.

titaniumbones commented 7 years ago

@atesgoral any chance of that in next say 29 hours?

titaniumbones commented 7 years ago

maybe don't worry about disabling the Submit button -- the logic is a little more complex b/c some of the sites we're nominating now are "uncrawlable" -- that is, they may be arhived but b/c they're complex & javascripty, the archived copy may be inadequate. So there might be motivation to save snapshots anyway.

atesgoral commented 7 years ago

@titaniumbones I can take a stab at this today.

titaniumbones commented 7 years ago

sweet.

On 02/03/2017 11:19 AM, Ates Goral wrote:

@titaniumbones https://github.com/titaniumbones I can take a stab at this today.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/eot-nomination-tool/issues/10#issuecomment-277291101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNJDiHUzFHVt2RwqKbOuzXs27rsAdks5rY1N0gaJpZM4LP9Ww.

atesgoral commented 7 years ago

I'm currently getting 503 responses from the CDX API. I either got blocked due to spamming them during development, or their infrastructure is flaky.

atesgoral commented 7 years ago

OK the problem is gone now. But it's worrisome that I got a bunch of 503 for at least a while. BTW, the implementation is complete. I'm just embellishing it a bit. Will cut a PR soon.

atesgoral commented 7 years ago

PR #49

Couldn't help but doing some non-pertinent refactoring along the way.

mhucka commented 7 years ago

Kudos to you @atesgoral!

mhucka commented 7 years ago

Regarding the 503 errors: My sense is that 503's are usually more about overload conditions on the server, so it may have been simply too many people trying to access it at the same time. Here's a web page that mentions that the response may include a Retry-After header – perhaps a future update to the nomination tool could check the header and take some action (like try again after the specified time): https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

titaniumbones commented 7 years ago

Hmm. So even if @atesgoral's implementation works, maybe we shouldn't try it out with 200 people all at once in NYC?

Mike Hucka notifications@github.com writes:

Regarding the 503 errors: My sense is that 503's are usually more about overload conditions on the server, so it may have been simply too many people trying to access it at the same time. This page mentions that the response may include a Retry-After header. Perhaps a future update to the nomination tool could check the header and take some action (like try again after the specified time): https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

--

mhucka commented 7 years ago

Huh. Well, that's a point, if that was the cause of the 503. I guess we don't know for sure? It also depends on the server implementation and capacity. Maybe we should try to reach out to IA people and ask if they have any thoughts on this?

I'll email Jefferson Bailey right now. I'm going to be offline for a couple of hours but if there's a reply later, I'll post what I learn here.

mhucka commented 7 years ago

Update: I received replies to about the 503s. Jefferson said he suspected a server glitch, and he cc'ed Mark Graham (director of the Wayback Machine), who wrote back:

Our systems SHOULD be able to take the load. Clearly from time to time that is not the case. I would not suggest we try to limit use at this time... and will look iinto things on our back end.

So, it sounds likely that the 503s were unusual, and we don't have to limit use.

atesgoral commented 7 years ago

Awesome! I'll re-enable the lookups.