EionRobb / purple-teams

A MS Teams plugin for libpurple/Pidgin (3rd party client)
GNU General Public License v3.0
158 stars 11 forks source link

High CPU usage on request failures #55

Open yourealwaysbe opened 2 months ago

yourealwaysbe commented 2 months ago

Thanks for developing this plugin.

I noticed high CPU usage today (50-90%). The debug logs were showing repeated request failures. I tracked these down to failed attempts to get the buddy icon, and a failed teams_poll in teams_messages.c.

I bodgily commented out the buddy icon fetch and the call to teams_poll. This fixed the failed requests cycles and the high CPU usage.

I don't mind about the buddy icons (i'm using bitlbee). The teams_poll looks kind of important, but things seem to be working ok with out it.

I'm not sure why the requests are failing or exactly what the latter request is for, so don't know the best way to approach a fix. It would be good to avoid the CPU burn if possible though.

EionRobb commented 2 months ago

Oh! That's not good! Definitely good to not use CPU.

Was it just a temporary high-cpu usage or was it fairly consistent? Would you be able to share the debug logs?

I've been wanting to get rid of the teams_poll for a while, just needed the trouter websocket option to be more reliable first. I've probably been a bit over-cautious with not wanting to miss any messages, but then it causes issues like #51 when messages are coming in through both the poll and the socket. I'll tweak it to only poll while the websocket isn't connected.

It doesn't make sense for bitlbee to be trying to fetch buddy icons, so I'll do some cheeky UI detection to skip if on bitlbee. Ideally it'd not chew up CPU for everyone though. Is it a fairly large buddy list, or were the debug logs showing that it was stuck in a loop?

yourealwaysbe commented 2 months ago

Was it just a temporary high-cpu usage or was it fairly consistent? Would you be able to share the debug logs?

Very consistent -- across several recompiles of bitlbee and purple-teams and reboots. It would ramp up within 10s of connecting the teams account, and stop when it was disconnected. The attached log has a load of certificate trusting, but the main failure i noticed watching the logs was the failed request to francecentral.

Tbh, i find connecting to Teams to be a bit weird in general, so it's possible it's just a connection error that will pass by tomorrow. (E.g. sometimes after 5pm i can't access the Teams calendar from FireFox, but can from Chromium.) It may also be some v1/v2 mix up as i get the "your organisation has switched to v2" message every month or so and seem to go through the setup all over again. The purple-teams plugin has worked well through all of this however.

logs.log

Is it a fairly large buddy list, or were the debug logs showing that it was stuck in a loop?

For the buddy icons, this came up (or became apparent) after disabling polling. It was producing lots of requests along the lines:

Jul 02 16:28:01 mattthink bitlbee[6139]: DEBUG teams: getting new buddy icon for orgid: Jul 02 16:28:01 mattthink bitlbee[6139]: DEBUG http: Performing new request 0x62c8e5a937c0 to teams.microsoft.com. Jul 02 16:28:01 mattthink bitlbee[6139]: DEBUG http: Request 0x62c8e61a3af0 performed without success.

It's a fairly long buddy list (perhaps everyone at my work) but i tested and the same IDs would come around again eventually (i stopped tracking after 4 repeats).

EionRobb commented 2 months ago

Added the buddy icon skipping for bitlbee in https://github.com/EionRobb/purple-teams/commit/d5bf8385b8de005dffaacfb23583fae540f34f93

yourealwaysbe commented 2 months ago

Great, this seems to work. The CPU usage dropped to 5-15% in the new version. Commenting out teams_poll gets the rest of the CPU usage down to trivial.

The connection problems continue in the office as well as from home (and on a different computer). This leads to the loop trying to connect to the endpoint (failing for whatever reason). It's not clear to me how the retry is triggered in the code -- i was hoping to put a delay on the retry.

EionRobb commented 2 months ago

Yeah, still trying to work out how to slot in the "don't use http-polling if websocket is up" bit without breaking a lot of the rules in the code that try to ensure the http-polling is always running. There's a bit of spooky-action-at-a-distance kind of level of stuff happening in the retry/reconnection logic that tries to keep the connection alive (which I'm not very happy about) which is probably the bit you're running into too. The old skypeweb system (that Teams is based off) used to not cope with any kind of delay, otherwise it could miss messages, which is why that exists the way it does.

Maybe disabling the http polling completely is the only way forwards. Just a bit nervous that the websockets aren't as battle-tested as polling is, but I guess there's only one way to find out...

EionRobb commented 2 months ago

Ok.. maybe one-point-five ways

I've added in a setting "only_use_websocket" (off by default) that skips over the teams_poll() code. Then us foolhardy pioneers can test whether there's any long-term effects of websocket exposure.

yourealwaysbe commented 2 months ago

Thanks -- that's better than my current hack of commenting it out :)

I'm also curios why the connection is failing -- i have nothing against doing the polling and it's a bit disturbing it can't get through to the right URL anymore.

How were the URLs determined? Perhaps i can poke around to see if they should be slightly different on my end for some reason.

EionRobb commented 2 months ago

The URL comes from the server, and we store in messages_host after we subscribe to the event stream https://github.com/EionRobb/purple-teams/blob/master/teams_messages.c#L1946

Would be good to understand if they're different, because that kinda sounds like what was going on in https://github.com/EionRobb/purple-teams/issues/53#issuecomment-2026175543 - maybe compare vs what the browser does (can use mitmproxy or Charles to sniff the traffic and compare)

yourealwaysbe commented 1 month ago

Thanks. I added a debug to see what the response from francecentral was. Looks like it's forbidden:

DEBUG teams: executing callback for https://francecentral-prod.notifications.teams.microsoft.com/users/8:orgid.../endpoints/.../events/poll?cursor=1720427450&epfs=srt with data '{"code":"Forbidden","subcode":"None","message":"Request from this platform and client version not allowed"}'

yourealwaysbe commented 1 month ago

From snooping through the web client logs, i changed TEAMS_CLIENTINFO_VERSION to 1415/24053101421 and the polling seems to work again.

scottbender commented 1 month ago

Just popping in to say that only_use_websocket reduced my cpu from 60-70% to almost nothing...