Open ACLay opened 5 months ago
I've just gotten off a debug zoom call with Tino. I had him email over a copy of his localstorage values to our support inbox first, then with the localstorage values cleared he's able to switch roles to FAP member without issue. However, attempting to view a proposal it's still going into a reload loop. The details of the proposal briefly flash onto screen, then it goes back to a loading icon for a couple seconds, and repeats indefinitely. It's at least not blocking the UI interactivity, as the close proposal x in the top left of the page still works, and the proposal list seems stable. Apparently this was an issue he was having before. The "grade proposal" link on the FAP page is having similar loading issues.
The brief attempt to load a proposal seemed to conincide with another spike in graphql activity, though shorter than the ones from Monday, I think as I had him back out of the page:
I also found out that he'd been sending reviews to the FAP secretary for entry to the system, so where we thought incognito mode was allowing him to enter reviews, that was not the case.
I really hope none of the issues he had on the call were related to outages from today's oracle db maintenance affecting the UOWS...
Local storage values from their browser immediately after signing in to UOP:
Key | Value |
---|---|
drawerOpen | 0 |
user | {"created":"2022-08-26T13:05:56.137Z","email":REDACTED,"firstname":"REDACTED","id":1053132,"lastname":"REDACTED","oidcSub":"","institutionId":1,"placeholder":false,"position":"","preferredname":"REDACTED"} |
currentRole | INSTRUMENT_SCIENTIST |
expToken | 1717669341 |
isInternalUser | true |
impersonatingUserId | undefined |
token | (base 64 decoded and redacted to the block below) |
{
"user": {
"created": "2022-08-26T13:05:56.137Z",
"email": "REDACTED",
"firstname": "REDACTED",
"id": 1053132,
"lastname": "REDACTED",
"oidcSub": "",
"institutionId": 1,
"placeholder": false,
"position": "",
"preferredname": "REDACTED"
},
"roles": [
{
"id": 1,
"shortCode": "user",
"title": "User"
},
{
"id": 6,
"shortCode": "fap_reviewer",
"title": "FAP Reviewer"
},
{
"id": 7,
"shortCode": "instrument_scientist",
"title": "Experiment Scientist"
}
],
"currentRole": {
"id": 7,
"shortCode": "instrument_scientist",
"title": "Experiment Scientist"
},
"isInternalUser": true,
"externalToken": "REDACTED",
"iat": 1717582941,
"exp": 1717669341
}
Is the problem happening with a specific proposal?
As George has mentioned, would also be interesting to know whether the other reviewers involved in these proposals get the same problem.
Queries that in apollo studio had unique spikes in usage around the time of the zoom test today with Tino (in order of total calls today):
Queries sent when clicking "view proposal" on dev:
Queries sent when clicking close on that proposal:
Queries sent when refreshing page on that proposal:
When looking at assigned proposals as a FAP reviewer, it seems that the Grade proposal
and View review
actions are backed by the same function/modal, so at least we just have the one bug and code path to look into.
Local environment on the master
branch.
Clicking "view proposal" on a reviewed proposal calls:
Clicking "grade proposal" instead:
I've been looking at this with @TCMeldrum today, and we're still unable to replicate it. I'm gonna email support and Emma, asking if any other instrument scientists had issues accessing the system. At a guess users 7503 & 8494 might be, as they haven't uploaded any of their requested reviews. However, it's possible it affects others and they just ignored/worked around the issue like Tino by sending the reviews to panel's secretary and had them upload on his behalf.
8494 is another possible candidate for the issue, as like Tino, they're one of only 4 accounts that have been assigned reviews and are also marked as an instrument scientist on the legacy ISIS "instrument", according to this db query:
select *
from fap_reviewers fr
join instrument_has_scientists ihs
on fr.user_id = ihs.user_id
The proxy monitoring job just hit the warning threshold again. The requests came again from a DHCP assigned IP from Tino's university, and the sessionId used when that started links back to his account again.
Different set of requests making up the bombardment this time though (1pm-2pm time range in apollo studio):
@ACLay does this issue affect FAP reviews, or just the meetings?
It was happening when he attempted to enter reviews. Him not having submitted reviews would impact their ability to run the meetings, but I don't know if it'd affect his ability to use other parts of the site.
@ACLay I hadn't seen your response until now - since this has the potential to impact reviews, this should be addressed as part of the ongoing FAP reviews work.
@TCMeldrum @Bhaswati1148 I'll add it to the FAP reviews epic, and I think it should be prioritised in the coming weeks.
I emailed Tino asking him to try accessing the site again, and apparently it's working fine for him at the moment.
Ideally I'd have liked to know what the issue was and what fixed it, but as long as this state holds I'll take it.
Hopefully it will hold and we won't end up back at square one when he starts being assigned proposals to review...
We may have had a recurrence of this. A single IP attached to an ISP in Indonesia was responsible for ~10% of our external web proxy's traffic today in the course of 30 minutes. Unlike last round I haven't been able to figure out a specific user.
That user had a couple more usage spikes today, and we tracked it back to user number 1151947.
@TCMeldrum found that forcing this check to always be true he could get similar effects in local. We've since been able to get similar looking load spikes to occur on prod with our accounts by setting our computer's clocks into the future so that they think the expiry time on even a fresh JWT is in the past.
We're planning to message the user and ask them to check their computer's clock, and if that fixes it for them we can look into a long term fix.
Additional digging suggests there were 3 users from Indonesia's National Research and Innovation Agency accessing our systems, presumably behind network address translation which caused them to share an IPv4 address. User numbers are: 1153026, 1130797, 1151947. One of them has submitted proposals, and the one we contacted seemed to have their clock set up correctly.
We've not yet found another potential trigger for the DoS-esque behaviour.
Since 5pm yesterday my local setup has been producing this sort of infinite page reloading effect without any timezone trickery. I haven't tried other browsers, but it's affecting firefox in both a regular and private window. @TCMeldrum has tested my current branch on his machine but isn't seeing the effect.
From what we've been able to gather from dev tools so far, the PageLayout component is being continually re-rendered due to changes in its hooks. The exact hooks vary from render to render, but the most common seems to be the loading state of the useCallsData hook oscillating between true and false. That said, useCallsData may just be toggling as a result of PageLayout being recreated, so... I dunno.
I've figured out the issue with my local setup, and why the branch was only erroring for me.
I was using a lambda function as an argument to the useDataApi
react hook used to query the api. That lambda function was produced by a function, so each call to useDataApi
was seeing a different instance of the function in its parameter list, causing it to trigger everything that used it to re-render.
However, it would only return that lambda function if a specific feature flag was set. If the db script to add that feature wasn't run when tom checked out the branch, which I think requires a container restart rather than just a checkout like the code reloading, then the feature wouldn't be set in the frontend, and it'd instead use a normal, constant function to set the header.
If this incident has been related to the issues seen by the few users, maybe there's something odd in the combination of their account settings and the data that they're requesting that's causing a similar hook dependency loop edge case to be triggered?
Wrapping my function parameter in a useCallback
to cache it stops my local setup from looping.
I've gone back through our auth service logs and cross referenced login times for the user numbers linked to the IP with the usage spikes. It seems like 1153026 was behind the spikes on the 14th.
On the 15th, 1153026 logged in as the first spike started as 7:32, 1130797 logged in as the 2nd started at 8:11.
So it seems none of the usage spikes are attributable to 1151947, who, as the only one we identified from that IP at the time, was the user we contacted to check their clock.
API usage charts for the october 14th and 15th spikes:
Checking the container logs for the output of the addClientLog
calls, they seem to mostly be UOWS errors, and don't seem to quite line up with the login times, so might just be failing for other users due to the higher load?
Incase it does turn out to be a system clock thing, I've found a very similar looking time check in UserContextProvider
What is the problem and why is it a problem
One of the ISIS FAP secretaries (user number 1053132) was unable to access the system in the FAP member role. They defaulted to instrument scientist, and attempting to switch caused them to see an empty page with the "Server rejected user credentials" error notification.
It seems that this caused their browser to infinitely attempt to reload the page's content, as we saw enormous spikes in graphql traffic whenever they attempted access, so much so that it caused a tenfold increase in the number of calls made to our api compared to the previous day: And triggered alerts from our traffic monitoring job of a suspected denial of service attack.
Due to the "server rejected user credentials" message, we determined that something must be going wrong in the
getToken
mutation, which was by far the most called query that day. On realising just how much access to localStorage takes place in the role switching code, we asked them to try using an incognito window, which let them log in and use the system normally. The working theory is that some old value had been persisted that caused the errors.I've asked them to get in contact for a debugging session with us before they clear out their local data for the domain so we can try to isolate the problematic value.
Steps to reproduce (if it's a bug).