Closed atesgoral closed 7 years ago
Agreed. that logic is a little harder to implement b/c we would have to query the underlying spreadsheet that the form submits to. To do that, the extension has to authenticate against google sheets... so, the backend is pretty unsatisfactory for solving this problem, it turns out.
Quick thought: Are the people using the chrome extension trusted to view all the URLs submitted so far ? If yes, just run a async query to have the url map in memory so it can match against the current page and display a match even before submitting ( thus saving them time!!)
Yes, I don't see why we wouldn't make that list freely accessible. Assigned to you! Feel free to unassign yourself!
If you have any further thoughts on this @shabscan let us know. Beyond me though I think.
Tagging @trinberg on this because I think we should start thinking about the long-term plans for these seeds. Ultimately I think we need a new backend that allows us to run more sophisticated queries on the db, e.g. giving raw numbers of seeds for all identified vulnerable programs, tying this back to some kind of visualization, etc... in any case, we could maybe start by figuring out whether it's practical to run a simple query that searches for the current URL in the spreadsheet before submitting.
Yes, thanks for adding me to this - several people at Philly mentioned that this would be a great feature that would save time during an archive-a-thon. Especially with the current workflow - people are working on particular offices in groups of 5-6. Even if the app was cross-checking the sites at one archiving event - that would be a good start I think. I don't know much about this, but my concern is that when the list gets long - querying will take longer and the app may run slower.
Regarding longer term plans: completely agree @titaniumbones. We should think about how to attribute seeds and then archived pages/data to a db structure that reflects the agency org charts (including priorities) and also think about visualizing the same in a good way.
Will follow up on this...
I agree with centralizing the records in a database for all these seeds. The extension can be then modified to post to a database and be generalized to be used at any event. This will allow for complex queries to be built on to the extension. From what I know there will have to be a intermediary webapp that will interface with a database to run its queries.
I'm very tempted to just dive in and create an API backend to handle this. But, I'd like to see if there's a cloud-based solution that we could concoct to avoid having to deploy a backend.
@paracycle, any thoughts?
Thanks for the mention @atesgoral. I can recommend a few cloud based, light-weight database services, with APIs, which act like and are managed as a (glorified) spreadsheets on the web:
Finally, I would recommend Webtask if you are interested in a easy to use, easy to deploy server-less backend/glue solution to tie everything up.
thx @paracycle , have not checked these out yet but as you look these over @atesgoral if the assessment of them all seems pretty even, it might be worth knowing that EDGI is using airtable for some other stuff.
A few people brought this up as a great feature to have during the event today, so I wanted to see if there is still interest in pushing this forward. Is cross-checking the google sheet URL column a realizable goal?
I could take a stab at this. @paracycle wanna collab?
everyone should recognize that WE DO NOT HAVE A SINGLE SHEET THAT STORES ALL URL's ANYMORE. So this would likely be an "in-event" tool -- if the event is well-targeted on a single or several agencies, this would stil lbe extremely helpful.
@trinberg, users should still be instructed to check the IA first before submission.
On 02/02/2017 08:14 AM, Ates Goral wrote:
I could take a stab at this. @paracycle https://github.com/paracycle wanna collab?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/eot-nomination-tool/issues/10#issuecomment-276953659, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNIJDeqp7MdA_12Ir44I5OqUxnA8Lks5rYdaegaJpZM4LP9Ww.
If there's no single list of all URLs, then what about at least querying the wayback machine's API to ask if a given URL is already in the Internet Archive? Not perfect, but perhaps better than nothing?
@mhucka I think that would be awesome. do you have time to implement (in popup.js)? If not I can try but I will be slow.
Can write a response in the same element currently being used to show success/failure messages maybe?
@mhucka I had suggested to make it even a bit more proactive in #35. Even without opening the popup, there can be a visual indicator on the toolbar if a non-archived page is visited. I don't know how robust the IA API is though...
@atesgoral do you think that's something you can implement? it sounds great.
@atesgoral That #35 is a great idea. Sorry I didn't see that.
@titaniumbones I don't want to promise (because I keep failing so miserably in keeping to them), but I think I can try to look into this next week.
FYI: I just learned about a different API offered by the Wayback machine: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
This has a helluva lot more functionality than the Wayback machine API and is probably the better thing to use for any queries to IA.
ah shit, I just wrote this up while you were typing -- in another bug no less, jeez -- anyway here's my take for hte record:
The internet archive API can tell you if a URL is already in the archive via a simple API, so:
http://archive.org/wayback/available?url=http://whitehouse.gov
will return
{"archived_snapshots":{"closest":{"available":true,"url":"http://web.archive.org/web/20170202062552/https://www.whitehouse.gov/","timestamp":"20170202062552","status":"200"}}}
or w/ jquery:
var response $.getJSON("http://archive.org/wayback/available?url=http://whitehouse.gov");
console.log(response.responseJSON.archived_snapshots.closest.timestamp);
I would love to see this:
<h3 id="success|error">
elements w/ two possibilities:
Is there anyone out there who thinks they might be able to od this by Saturday @mi-lee @sonalranjit @atesgoral @danielballan @geppy?
If you don't know what the pipeline app is, don't worry about it, we are moving the URL tracking out of Google forms finally ,eventually this app will also move.
@atesgoral if you think your other cooler idea can also work pls go ahead and implement that!!!!Sat morning is good, Friday even is better, later is also fine but then I will stop worrying about this. Thanks all
Just for the record, the API you mentioned is what check-ia uses, but it is a suboptimal API – this is the API that's been given us the problem of occasional false positives. Apparently the CDX api is better and is the thing that IA people themselves use (according to Jefferson). So in terms of functionality, what you wrote above is fine, but for the implementation I recommend not using the API you mentioned there (the one that returns "archived_snapshots"
)
so the preferred code would be:
r = $.getJSON("http://web.archive.org/cdx/search/cdx", {url:"whitehouse.gov", limit:-1, output:"json"})
ts=r.responseJSON[1][1]
if (do-compare-dates-in-js-however-thats-done(now,ts) > 1month) { {is_archived}} else {{! is_archived}};
yes?
I think without an explicit "Check" button, the popup can go ahead and kick off the check in the background and only enable the Submit button after the lookup has finished/failed/timed out. Aim: reducing the number of clicks the user has to perform. I think we can be agile with this and implement it first in the popup (with the API mentioned above) and then worry about my suggestion in #35 as a nice-to-have.
@atesgoral any chance of that in next say 29 hours?
maybe don't worry about disabling the Submit button -- the logic is a little more complex b/c some of the sites we're nominating now are "uncrawlable" -- that is, they may be arhived but b/c they're complex & javascripty, the archived copy may be inadequate. So there might be motivation to save snapshots anyway.
@titaniumbones I can take a stab at this today.
sweet.
On 02/03/2017 11:19 AM, Ates Goral wrote:
@titaniumbones https://github.com/titaniumbones I can take a stab at this today.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/eot-nomination-tool/issues/10#issuecomment-277291101, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNJDiHUzFHVt2RwqKbOuzXs27rsAdks5rY1N0gaJpZM4LP9Ww.
I'm currently getting 503 responses from the CDX API. I either got blocked due to spamming them during development, or their infrastructure is flaky.
OK the problem is gone now. But it's worrisome that I got a bunch of 503 for at least a while. BTW, the implementation is complete. I'm just embellishing it a bit. Will cut a PR soon.
PR #49
Couldn't help but doing some non-pertinent refactoring along the way.
Kudos to you @atesgoral!
Regarding the 503 errors: My sense is that 503's are usually more about overload conditions on the server, so it may have been simply too many people trying to access it at the same time. Here's a web page that mentions that the response may include a Retry-After header – perhaps a future update to the nomination tool could check the header and take some action (like try again after the specified time): https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
Hmm. So even if @atesgoral's implementation works, maybe we shouldn't try it out with 200 people all at once in NYC?
Mike Hucka notifications@github.com writes:
Regarding the 503 errors: My sense is that 503's are usually more about overload conditions on the server, so it may have been simply too many people trying to access it at the same time. This page mentions that the response may include a Retry-After header. Perhaps a future update to the nomination tool could check the header and take some action (like try again after the specified time): https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
--
Huh. Well, that's a point, if that was the cause of the 503. I guess we don't know for sure? It also depends on the server implementation and capacity. Maybe we should try to reach out to IA people and ask if they have any thoughts on this?
I'll email Jefferson Bailey right now. I'm going to be offline for a couple of hours but if there's a reply later, I'll post what I learn here.
Update: I received replies to about the 503s. Jefferson said he suspected a server glitch, and he cc'ed Mark Graham (director of the Wayback Machine), who wrote back:
Our systems SHOULD be able to take the load. Clearly from time to time that is not the case. I would not suggest we try to limit use at this time... and will look iinto things on our back end.
So, it sounds likely that the 503s were unusual, and we don't have to limit use.
Awesome! I'll re-enable the lookups.
It would be nice to warn the submitted -- prior to submitting -- that a URL already has been submitted.