Closed SamLau95 closed 6 years ago
https://github.com/DS-100/textbook/pull/2 removes nbinteract
entirely from the Data 100 textbook. To check that this succeeded, I opened https://www.textbook.ds100.org/ch4/cleaning_1.html# and verified in the Network tab of the devtools that the page makes no requests to mybinder.org.
If there are no further requests to launch Binder images for the textbook, this suggests that some lingering bug in the Javascript implementation overloaded the Binder servers. If there are still requests for the textbook Binder image, we might have a caching issue or something else entirely. I'll wait on Yuvi to get back to us.
There are still requests being made at about the same rate as before :)
On Wed, Feb 7, 2018 at 5:25 PM, Sam Lau notifications@github.com wrote:
DS-100/textbook#2 https://github.com/DS-100/textbook/pull/2 removes nbinteract entirely from the Data 100 textbook. To verify that this succeeded, I opened https://www.textbook.ds100.org/ch4/cleaning_1.html# and verified in the Network tab of the devtools that the page makes no requests to mybinder.org.
If there are no further requests to launch Binder images for the textbook, this suggests that some lingering bug in the Javascript implementation overloaded the Binder servers. I'll wait on Yuvi to get back to us.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-363971201, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23l0ATmvuSudRjoTxkafZtrutMfskks5tSkzwgaJpZM4R9p0m .
-- Yuvi Panda T http://yuvi.in/blog
@yuvipanda Wow :(. Is there anything I can do to help out? Happy to parse through traffic logs and analyze data if need be.
I think we fixed our banning code (Because the repo's normalized name is DS-100/textbook, not ds-100/textbook) so I think we're ok for now. We can look at logs later (tomorrow maybe?) and see if this is coming from one person.
On Wed, Feb 7, 2018 at 5:50 PM, Sam Lau notifications@github.com wrote:
@yuvipanda https://github.com/yuvipanda Wow :(. Is there anything I can do to help out? Happy to parse through traffic logs and analyze data if need be.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-363975724, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23nUcBoePhotcHzmbhhBW0RkXSLsJks5tSlLWgaJpZM4R9p0m .
-- Yuvi Panda T http://yuvi.in/blog
Could this be something like https://github.com/jupyterhub/binderhub/issues/317?
That's the most likely culprit.
On Wed, Feb 7, 2018 at 11:09 PM, Tim Head notifications@github.com wrote:
Could this be something like jupyterhub/binderhub#317 https://github.com/jupyterhub/binderhub/issues/317?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-364021989, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23gsEE8Hbb_W-SagFHSkz1fXyrcG_ks5tSp2lgaJpZM4R9p0m .
-- Yuvi Panda T http://yuvi.in/blog
That was my suspicion, too, but the code looks like it's closing the eventsource correctly. Double-checking that is a good place to start, though.
Although it looks like the code closes EventSources correctly when a server is launched, I found out that we never closed the EventSource when the server fails to launch which caused the EventSource to continue making launch requests. This is fixed in #35.
Still, I'm not entirely convinced this was the reason behind the Binder flooding since I can only recreate this behavior when I turn off my internet or when I use a banned image.
@yuvipanda , are you guys still getting flooded with requests for DS-100/textbook
? If so, I'd love to take a look at the logs or other data that might give us a better sense of what's happening.
These are the logs on the binder
pod that have DS-100
in them in the last 24hr:
choldgraf@barbecue ~ $ kubectl --namespace=prod logs --since 24h binder-6d7694565c-ljg6q | grep DS-100
[I 180209 19:15:12 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.75ms
[I 180209 19:15:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.66ms
[I 180209 19:15:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.59ms
[I 180209 19:15:23 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.71ms
@choldgraf Heya, can I request to see the result of that command again? I've added nbinteract
back to the DS-100 textbook as of yesterday morning (~30 hours ago). If the JS is still making too many requests for Binder servers we should see them in the logs for the past 24h.
The textbook analytics say that we're averaging ~20 pageviews per hour, so we should hopefully see less than ~100 requests total. As a caveat, each time someone clicks the Run Widgets button the JS makes 3 attempts to start the server before giving up so the actual number might be higher. Still, we should hopefully not see a continuous stream of requests as we did before.
Here's what I've got:
[I 180211 22:37:43 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.64ms
[I 180211 22:37:46 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.67ms
[I 180211 22:37:49 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.50ms
[I 180212 00:35:34 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms
[I 180212 00:35:37 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.57ms
[I 180212 00:35:40 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.56ms
[I 180212 04:49:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.72ms
[I 180212 04:49:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.80ms
[I 180212 04:49:21 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms
seems reasonable? (about to sign off for the night but can spot-check problems tomorrow)
Can you try turning it on when pointing it at staging.mybinder.org, and we can unban it there and see what happens? We should set an explicit time for this to happen (ideally not involving me! :D) so everyone is around and can turn things on / off when we have problems...
On Sun, Feb 11, 2018 at 9:02 PM, Chris Holdgraf notifications@github.com wrote:
Here's what I've got:
[I 180211 22:37:43 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.64ms [I 180211 22:37:46 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.67ms [I 180211 22:37:49 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.50ms [I 180212 00:35:34 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms [I 180212 00:35:37 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.57ms [I 180212 00:35:40 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.56ms [I 180212 04:49:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.72ms [I 180212 04:49:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.80ms [I 180212 04:49:21 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms
seems reasonable? (about to sign off for the night but can spot-check problems tomorrow)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-364829677, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23h7YgN2PLCJWVUmBi8MjpfwB4CtCks5tT8XvgaJpZM4R9p0m .
-- Yuvi Panda T http://yuvi.in/blog
@yuvipanda sounds like a great idea to test things out. Should I get in contact with @choldgraf to set a time?
@SamLau95 let's do this when we meet this week...it should be relatively clear if staging is 💥 when we un-ban it yeah?
@choldgraf Sounds good to me. See you Wednesday!
After pushing some fixes, we unbanned nbinteract on the staging mybinder.org for a week and didn't see Binder overload. Because of this, we've just unbanned nbinteract on production and we consider this issue resolved for now.
https://github.com/SamLau95/nbinteract/issues/33 and https://github.com/jupyterhub/binder/issues/34 report that the mybinder.org servers are overloaded with notebook servers for the nbinteract docs (https://github.com/SamLau95/nbinteract-image) and the Data 100 textbook (https://github.com/DS-100/textbook).
The morning of Feb 7, 2018, @yuvipanda wisely blocked both repos on mybinder.org and immediately saw pod usage go back to normal levels.
After Yuvi notified me of this, I made the following changes to the places where we use
nbinteract
:We only start a Binder server when a user manually clicks a Run Widgets button instead of on page load. Although this was implemented a long time ago, I mis-configured the nbinteract docs and Data 100 textbook (my bad!).
Example page with Run Widgets button: https://www.textbook.ds100.org/ch4/cleaning_1.html
To verify the implementation, I visited the page above, opened the Network tab in the dev tools, and made a hard page refresh. I checked that the page makes no requests to
mybinder.org
until the Run Widgets button is clicked.Following @minrk 's suggestion, I cache Binder server information in
localStorage
to re-use servers across page refreshes.To verify the implementation, I started a local notebook server and opened a page that used
nbinteract
locally. I checked that the page re-uses the same kernel across page refreshes and hard refreshes.After I made these changes, Yuvi unblocked us but had to immediately block us again because mybinder.org was still getting hammered with notebook requests!
My first thought was that many students were visiting the Data 100 textbook with old versions of Javascript cached by their browsers. However, it looks like neither nbinteract docs nor Data 100 textbook got enough traffic to produce the load that the Binder team reported. Here are the pageviews plotted for each hour from 12am-3pm Feb 7, 2018 (Yuvi blocked us for the second time at around 2pm):
Data 100 textbook:
nbinteract docs:
There were a max of 25 pageviews during any given hour across both of these Gitbooks, far below the numerous notebook requests that the mybinder.org servers received. The pageview data is from Google Analytics and matches the separately generated pageview data from the built-in Gitbook platform so I'm inclined to believe its faithfulness.
It looks like something else is going on but I'm not sure what it could be. Here are some possibilities:
curl
).To distinguish between 1), 2), or other possibilities we'll want to know which IPs, devices, and user agents are generating these requests.
@yuvipanda @choldgraf @minrk @betatim Do you have any thoughts or suggestions?