Denial of Service effect when some users attempt to access our system

ACLay commented 5 months ago

What is the problem and why is it a problem

One of the ISIS FAP secretaries (user number 1053132) was unable to access the system in the FAP member role. They defaulted to instrument scientist, and attempting to switch caused them to see an empty page with the "Server rejected user credentials" error notification.

It seems that this caused their browser to infinitely attempt to reload the page's content, as we saw enormous spikes in graphql traffic whenever they attempted access, so much so that it caused a tenfold increase in the number of calls made to our api compared to the previous day: And triggered alerts from our traffic monitoring job of a suspected denial of service attack.

Due to the "server rejected user credentials" message, we determined that something must be going wrong in the getToken mutation, which was by far the most called query that day. On realising just how much access to localStorage takes place in the role switching code, we asked them to try using an incognito window, which let them log in and use the system normally. The working theory is that some old value had been persisted that caused the errors.

I've asked them to get in contact for a debugging session with us before they clear out their local data for the domain so we can try to isolate the problematic value.

Steps to reproduce (if it's a bug).

ACLay commented 5 months ago

I've just gotten off a debug zoom call with Tino. I had him email over a copy of his localstorage values to our support inbox first, then with the localstorage values cleared he's able to switch roles to FAP member without issue. However, attempting to view a proposal it's still going into a reload loop. The details of the proposal briefly flash onto screen, then it goes back to a loading icon for a couple seconds, and repeats indefinitely. It's at least not blocking the UI interactivity, as the close proposal x in the top left of the page still works, and the proposal list seems stable. Apparently this was an issue he was having before. The "grade proposal" link on the FAP page is having similar loading issues.

The brief attempt to load a proposal seemed to conincide with another spike in graphql activity, though shorter than the ones from Monday, I think as I had him back out of the page:

ACLay commented 5 months ago

I also found out that he'd been sending reviews to the FAP secretary for entry to the system, so where we thought incognito mode was allowing him to enter reviews, that was not the case.

ACLay commented 5 months ago

I really hope none of the issues he had on the call were related to outages from today's oracle db maintenance affecting the UOWS...

ACLay commented 5 months ago

Local storage values from their browser immediately after signing in to UOP:

Key	Value
drawerOpen	0
user	{"created":"2022-08-26T13:05:56.137Z","email":REDACTED,"firstname":"REDACTED","id":1053132,"lastname":"REDACTED","oidcSub":"","institutionId":1,"placeholder":false,"position":"","preferredname":"REDACTED"}
currentRole	INSTRUMENT_SCIENTIST
expToken	1717669341
isInternalUser	true
impersonatingUserId	undefined
token	(base 64 decoded and redacted to the block below)

{
  "user": {
    "created": "2022-08-26T13:05:56.137Z",
    "email": "REDACTED",
    "firstname": "REDACTED",
    "id": 1053132,
    "lastname": "REDACTED",
    "oidcSub": "",
    "institutionId": 1,
    "placeholder": false,
    "position": "",
    "preferredname": "REDACTED"
  },
  "roles": [
    {
      "id": 1,
      "shortCode": "user",
      "title": "User"
    },
    {
      "id": 6,
      "shortCode": "fap_reviewer",
      "title": "FAP Reviewer"
    },
    {
      "id": 7,
      "shortCode": "instrument_scientist",
      "title": "Experiment Scientist"
    }
  ],
  "currentRole": {
    "id": 7,
    "shortCode": "instrument_scientist",
    "title": "Experiment Scientist"
  },
  "isInternalUser": true,
  "externalToken": "REDACTED",
  "iat": 1717582941,
  "exp": 1717669341
}

joshhdawes commented 5 months ago

Is the problem happening with a specific proposal?

As George has mentioned, would also be interesting to know whether the other reviewers involved in these proposals get the same problem.

ACLay commented 5 months ago

Queries that in apollo studio had unique spikes in usage around the time of the zoom test today with Tino (in order of total calls today):

getPageContent (51/min @ spike)
getCalls (26/min @ spike)
getReview (11/min @ spike)
getToken (11/min @ spike)
getProposal (11/min @ spike)
userWithReviews (11/min @ spike)
getInstruments (11/min @ spike)
getProposalStatuses (8/min @ spike)

Queries sent when clicking "view proposal" on dev:

getReview
getProposal
getInstrumentsByIds

Queries sent when clicking close on that proposal:

getReview

Queries sent when refreshing page on that proposal:

getSettings
getFeatures
getCalls
getInstruments
userWithReviews
getPageContent (REVIEWPAGE)
getCalls
getPageContent (PRIVACYPAGE)
getPageContent (HELPPAGE)
getPageContent (FOOTERCONTENT)
getReview
getProposal
getReview
getInstrumentByIds

When looking at assigned proposals as a FAP reviewer, it seems that the Grade proposal and View review actions are backed by the same function/modal, so at least we just have the one bug and code path to look into.

ACLay commented 5 months ago

Local environment on the master branch.

Clicking "view proposal" on a reviewed proposal calls:

getReview
getProposal
getInstrument

Clicking "grade proposal" instead:

getReview
getProposal
getFap

I've been looking at this with @TCMeldrum today, and we're still unable to replicate it. I'm gonna email support and Emma, asking if any other instrument scientists had issues accessing the system. At a guess users 7503 & 8494 might be, as they haven't uploaded any of their requested reviews. However, it's possible it affects others and they just ignored/worked around the issue like Tino by sending the reviews to panel's secretary and had them upload on his behalf.

8494 is another possible candidate for the issue, as like Tino, they're one of only 4 accounts that have been assigned reviews and are also marked as an instrument scientist on the legacy ISIS "instrument", according to this db query:

select *
from fap_reviewers fr
join instrument_has_scientists ihs 
on fr.user_id = ihs.user_id

ACLay commented 4 months ago

The proxy monitoring job just hit the warning threshold again. The requests came again from a DHCP assigned IP from Tino's university, and the sessionId used when that started links back to his account again.

Different set of requests making up the bombardment this time though (1pm-2pm time range in apollo studio):

joshhdawes commented 2 months ago

@ACLay does this issue affect FAP reviews, or just the meetings?

ACLay commented 2 months ago

It was happening when he attempted to enter reviews. Him not having submitted reviews would impact their ability to run the meetings, but I don't know if it'd affect his ability to use other parts of the site.

joshhdawes commented 2 months ago

@ACLay I hadn't seen your response until now - since this has the potential to impact reviews, this should be addressed as part of the ongoing FAP reviews work.

@TCMeldrum @Bhaswati1148 I'll add it to the FAP reviews epic, and I think it should be prioritised in the coming weeks.

ACLay commented 1 month ago

I emailed Tino asking him to try accessing the site again, and apparently it's working fine for him at the moment.

Ideally I'd have liked to know what the issue was and what fixed it, but as long as this state holds I'll take it.

Hopefully it will hold and we won't end up back at square one when he starts being assigned proposals to review...

ACLay commented 3 weeks ago

We may have had a recurrence of this. A single IP attached to an ISP in Indonesia was responsible for ~10% of our external web proxy's traffic today in the course of 30 minutes. Unlike last round I haven't been able to figure out a specific user.

ACLay commented 2 weeks ago

That user had a couple more usage spikes today, and we tracked it back to user number 1151947.

@TCMeldrum found that forcing this check to always be true he could get similar effects in local. We've since been able to get similar looking load spikes to occur on prod with our accounts by setting our computer's clocks into the future so that they think the expiry time on even a fresh JWT is in the past.

We're planning to message the user and ask them to check their computer's clock, and if that fixes it for them we can look into a long term fix.

ACLay commented 2 weeks ago

Additional digging suggests there were 3 users from Indonesia's National Research and Innovation Agency accessing our systems, presumably behind network address translation which caused them to share an IPv4 address. User numbers are: 1153026, 1130797, 1151947. One of them has submitted proposals, and the one we contacted seemed to have their clock set up correctly.

We've not yet found another potential trigger for the DoS-esque behaviour.

ACLay commented 3 days ago

Since 5pm yesterday my local setup has been producing this sort of infinite page reloading effect without any timezone trickery. I haven't tried other browsers, but it's affecting firefox in both a regular and private window. @TCMeldrum has tested my current branch on his machine but isn't seeing the effect.

From what we've been able to gather from dev tools so far, the PageLayout component is being continually re-rendered due to changes in its hooks. The exact hooks vary from render to render, but the most common seems to be the loading state of the useCallsData hook oscillating between true and false. That said, useCallsData may just be toggling as a result of PageLayout being recreated, so... I dunno.

ACLay commented 15 hours ago

I've figured out the issue with my local setup, and why the branch was only erroring for me.

I was using a lambda function as an argument to the useDataApi react hook used to query the api. That lambda function was produced by a function, so each call to useDataApi was seeing a different instance of the function in its parameter list, causing it to trigger everything that used it to re-render.

However, it would only return that lambda function if a specific feature flag was set. If the db script to add that feature wasn't run when tom checked out the branch, which I think requires a container restart rather than just a checkout like the code reloading, then the feature wouldn't be set in the frontend, and it'd instead use a normal, constant function to set the header.

If this incident has been related to the issues seen by the few users, maybe there's something odd in the combination of their account settings and the data that they're requesting that's causing a similar hook dependency loop edge case to be triggered?

Wrapping my function parameter in a useCallback to cache it stops my local setup from looping.

ACLay commented 12 hours ago

I've gone back through our auth service logs and cross referenced login times for the user numbers linked to the IP with the usage spikes. It seems like 1153026 was behind the spikes on the 14th.

On the 15th, 1153026 logged in as the first spike started as 7:32, 1130797 logged in as the 2nd started at 8:11.

So it seems none of the usage spikes are attributable to 1151947, who, as the only one we identified from that IP at the time, was the user we contacted to check their clock.

ACLay commented 12 hours ago

API usage charts for the october 14th and 15th spikes:

Checking the container logs for the output of the addClientLog calls, they seem to mostly be UOWS errors, and don't seem to quite line up with the login times, so might just be failing for other users due to the higher load?

ACLay commented 10 hours ago

Incase it does turn out to be a system clock thing, I've found a very similar looking time check in UserContextProvider

UserOfficeProject / issue-tracker

Denial of Service effect when some users attempt to access our system #1094

What is the problem and why is it a problem

Steps to reproduce (if it's a bug).