SamLau95 / nbinteract

Create interactive webpages from Jupyter Notebooks
http://www.samlau.me/nbinteract/
BSD 3-Clause "New" or "Revised" License
233 stars 22 forks source link

Figure out why we're overloading Binder #34

Closed SamLau95 closed 6 years ago

SamLau95 commented 6 years ago

https://github.com/SamLau95/nbinteract/issues/33 and https://github.com/jupyterhub/binder/issues/34 report that the mybinder.org servers are overloaded with notebook servers for the nbinteract docs (https://github.com/SamLau95/nbinteract-image) and the Data 100 textbook (https://github.com/DS-100/textbook).

The morning of Feb 7, 2018, @yuvipanda wisely blocked both repos on mybinder.org and immediately saw pod usage go back to normal levels.

After Yuvi notified me of this, I made the following changes to the places where we use nbinteract:

  1. We only start a Binder server when a user manually clicks a Run Widgets button instead of on page load. Although this was implemented a long time ago, I mis-configured the nbinteract docs and Data 100 textbook (my bad!).

    Example page with Run Widgets button: https://www.textbook.ds100.org/ch4/cleaning_1.html

    To verify the implementation, I visited the page above, opened the Network tab in the dev tools, and made a hard page refresh. I checked that the page makes no requests to mybinder.org until the Run Widgets button is clicked.

  2. Following @minrk 's suggestion, I cache Binder server information in localStorage to re-use servers across page refreshes.

    To verify the implementation, I started a local notebook server and opened a page that used nbinteract locally. I checked that the page re-uses the same kernel across page refreshes and hard refreshes.

After I made these changes, Yuvi unblocked us but had to immediately block us again because mybinder.org was still getting hammered with notebook requests!

My first thought was that many students were visiting the Data 100 textbook with old versions of Javascript cached by their browsers. However, it looks like neither nbinteract docs nor Data 100 textbook got enough traffic to produce the load that the Binder team reported. Here are the pageviews plotted for each hour from 12am-3pm Feb 7, 2018 (Yuvi blocked us for the second time at around 2pm):

Data 100 textbook:

screenshot 2018-02-07 14 15 51

nbinteract docs:

screenshot 2018-02-07 14 17 49

There were a max of 25 pageviews during any given hour across both of these Gitbooks, far below the numerous notebook requests that the mybinder.org servers received. The pageview data is from Google Analytics and matches the separately generated pageview data from the built-in Gitbook platform so I'm inclined to believe its faithfulness.

It looks like something else is going on but I'm not sure what it could be. Here are some possibilities:

  1. Some users have a really faulty cached version of the Javascript that just repeatedly requests kernels from Binder. We should observe a small set of IPs that are repeatedly requesting servers from a browser.
  2. A broken script is running somewhere that repeatedly requests Binder servers. We should observe a smaller set of IPs that are repeated requesting servers from a non-browser User Agent (e.g. curl).

To distinguish between 1), 2), or other possibilities we'll want to know which IPs, devices, and user agents are generating these requests.

@yuvipanda @choldgraf @minrk @betatim Do you have any thoughts or suggestions?

SamLau95 commented 6 years ago

https://github.com/DS-100/textbook/pull/2 removes nbinteract entirely from the Data 100 textbook. To check that this succeeded, I opened https://www.textbook.ds100.org/ch4/cleaning_1.html# and verified in the Network tab of the devtools that the page makes no requests to mybinder.org.

If there are no further requests to launch Binder images for the textbook, this suggests that some lingering bug in the Javascript implementation overloaded the Binder servers. If there are still requests for the textbook Binder image, we might have a caching issue or something else entirely. I'll wait on Yuvi to get back to us.

yuvipanda commented 6 years ago

There are still requests being made at about the same rate as before :)

On Wed, Feb 7, 2018 at 5:25 PM, Sam Lau notifications@github.com wrote:

DS-100/textbook#2 https://github.com/DS-100/textbook/pull/2 removes nbinteract entirely from the Data 100 textbook. To verify that this succeeded, I opened https://www.textbook.ds100.org/ch4/cleaning_1.html# and verified in the Network tab of the devtools that the page makes no requests to mybinder.org.

If there are no further requests to launch Binder images for the textbook, this suggests that some lingering bug in the Javascript implementation overloaded the Binder servers. I'll wait on Yuvi to get back to us.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-363971201, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23l0ATmvuSudRjoTxkafZtrutMfskks5tSkzwgaJpZM4R9p0m .

-- Yuvi Panda T http://yuvi.in/blog

SamLau95 commented 6 years ago

@yuvipanda Wow :(. Is there anything I can do to help out? Happy to parse through traffic logs and analyze data if need be.

yuvipanda commented 6 years ago

I think we fixed our banning code (Because the repo's normalized name is DS-100/textbook, not ds-100/textbook) so I think we're ok for now. We can look at logs later (tomorrow maybe?) and see if this is coming from one person.

On Wed, Feb 7, 2018 at 5:50 PM, Sam Lau notifications@github.com wrote:

@yuvipanda https://github.com/yuvipanda Wow :(. Is there anything I can do to help out? Happy to parse through traffic logs and analyze data if need be.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-363975724, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23nUcBoePhotcHzmbhhBW0RkXSLsJks5tSlLWgaJpZM4R9p0m .

-- Yuvi Panda T http://yuvi.in/blog

betatim commented 6 years ago

Could this be something like https://github.com/jupyterhub/binderhub/issues/317?

yuvipanda commented 6 years ago

That's the most likely culprit.

On Wed, Feb 7, 2018 at 11:09 PM, Tim Head notifications@github.com wrote:

Could this be something like jupyterhub/binderhub#317 https://github.com/jupyterhub/binderhub/issues/317?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-364021989, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23gsEE8Hbb_W-SagFHSkz1fXyrcG_ks5tSp2lgaJpZM4R9p0m .

-- Yuvi Panda T http://yuvi.in/blog

minrk commented 6 years ago

That was my suspicion, too, but the code looks like it's closing the eventsource correctly. Double-checking that is a good place to start, though.

SamLau95 commented 6 years ago

Although it looks like the code closes EventSources correctly when a server is launched, I found out that we never closed the EventSource when the server fails to launch which caused the EventSource to continue making launch requests. This is fixed in #35.

Still, I'm not entirely convinced this was the reason behind the Binder flooding since I can only recreate this behavior when I turn off my internet or when I use a banned image.

@yuvipanda , are you guys still getting flooded with requests for DS-100/textbook? If so, I'd love to take a look at the logs or other data that might give us a better sense of what's happening.

choldgraf commented 6 years ago

These are the logs on the binder pod that have DS-100 in them in the last 24hr:

choldgraf@barbecue ~ $ kubectl --namespace=prod logs --since 24h binder-6d7694565c-ljg6q | grep DS-100
[I 180209 19:15:12 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.75ms
[I 180209 19:15:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.66ms
[I 180209 19:15:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.59ms
[I 180209 19:15:23 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.9.6) 1.71ms
SamLau95 commented 6 years ago

@choldgraf Heya, can I request to see the result of that command again? I've added nbinteract back to the DS-100 textbook as of yesterday morning (~30 hours ago). If the JS is still making too many requests for Binder servers we should see them in the logs for the past 24h.

The textbook analytics say that we're averaging ~20 pageviews per hour, so we should hopefully see less than ~100 requests total. As a caveat, each time someone clicks the Run Widgets button the JS makes 3 attempts to start the server before giving up so the actual number might be higher. Still, we should hopefully not see a continuous stream of requests as we did before.

choldgraf commented 6 years ago

Here's what I've got:

[I 180211 22:37:43 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.64ms
[I 180211 22:37:46 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.67ms
[I 180211 22:37:49 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.50ms
[I 180212 00:35:34 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms
[I 180212 00:35:37 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.57ms
[I 180212 00:35:40 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.56ms
[I 180212 04:49:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.72ms
[I 180212 04:49:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.80ms
[I 180212 04:49:21 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms

seems reasonable? (about to sign off for the night but can spot-check problems tomorrow)

yuvipanda commented 6 years ago

Can you try turning it on when pointing it at staging.mybinder.org, and we can unban it there and see what happens? We should set an explicit time for this to happen (ideally not involving me! :D) so everyone is around and can turn things on / off when we have problems...

On Sun, Feb 11, 2018 at 9:02 PM, Chris Holdgraf notifications@github.com wrote:

Here's what I've got:

[I 180211 22:37:43 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.64ms [I 180211 22:37:46 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.67ms [I 180211 22:37:49 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.50ms [I 180212 00:35:34 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms [I 180212 00:35:37 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.57ms [I 180212 00:35:40 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.56ms [I 180212 04:49:15 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.72ms [I 180212 04:49:18 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.80ms [I 180212 04:49:21 web:2064] 200 GET /build/gh/DS-100/textbook/master (10.12.8.6) 1.77ms

seems reasonable? (about to sign off for the night but can spot-check problems tomorrow)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamLau95/nbinteract/issues/34#issuecomment-364829677, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23h7YgN2PLCJWVUmBi8MjpfwB4CtCks5tT8XvgaJpZM4R9p0m .

-- Yuvi Panda T http://yuvi.in/blog

SamLau95 commented 6 years ago

@yuvipanda sounds like a great idea to test things out. Should I get in contact with @choldgraf to set a time?

choldgraf commented 6 years ago

@SamLau95 let's do this when we meet this week...it should be relatively clear if staging is 💥 when we un-ban it yeah?

SamLau95 commented 6 years ago

@choldgraf Sounds good to me. See you Wednesday!

SamLau95 commented 6 years ago

After pushing some fixes, we unbanned nbinteract on the staging mybinder.org for a week and didn't see Binder overload. Because of this, we've just unbanned nbinteract on production and we consider this issue resolved for now.