MicrosoftDocs / msteams-docs

Source for the Microsoft Teams developer platform documentation.
https://aka.ms/teamsdev
Creative Commons Attribution 4.0 International
280 stars 501 forks source link

Random 401 messages in the replies received on inbound Webhook Teams channel notification #9698

Open nssecrier opened 10 months ago

nssecrier commented 10 months ago

Hello,

We have started to receive some random 401 (unauthorized) error inside the responses back to messages posted on inbound Webhook Teams channels. This was working fine this month but we start seeing these on Friday and a significant increase on yesterday (Mon 16th Oct). Behaviour noted in different Teams channels and also different MS organisations (our customers). I have sanitized some of the content (keys, tokens, etc) but happy to share the full message privately.

See below the bad message:

15:27:03.479 [http-nio-8050-exec-1] [] [] [] INFO i.v.n.g.teams.api.TeamsStrategy - Teams post notification response: <200 OK OK,Microsoft.Exchange.Security.TokenIssuer.Common.SubstrateTokenRequestException: Request db3754d9-2f88-42cf-a201-e9131f4e3251;1697470023;1.0;48af08dc-f6d2-435f-b2a7-069abd99c086;https://outlook.office.com;;[],ea80952ea47642d4aaf45457852b0f7e,48af08dc-f6d2-435f-b2a7-069abd99c086,;GlobalActor;{TokenScopeType,CloudInstance},{AccessType,HpaAsApp},{PopKey,[sanitised]to https://localhost:444/sts/token/1 on server AS4PR07MB8684 failed with status code 401(Unauthorized) and reason {Message:STI server AS4PR07MB8684 failed to process request. Error: Request [sanitised];https://outlook.office.com;;[],[sanitised],;GlobalActor;{TokenScopeType,CloudInstance},{AccessType,HpaAsApp},{PopKey,{[sanitised]\ is unauthorized, the public key warqvR45thDtXFmf0RCvmtXXdLU= is not found for ring WW, cannot validate signature}. The response headers are: Date:Mon, 16 Oct 2023 15:27:02 GMT;Server:Microsoft-HTTPAPI/2.0.,[alt-svc:h3=:443,h3-29=:443, cache-control:no-cache, content-length:2020, content-type:text/plain; charset=utf-8, date:Mon, 16 Oct 2023 15:27:03 GMT, expires:-1, ms-cv:F1nlD18PvVdkFcucJ+TzOA.1.1, pragma:no-cache, request-id:0fe55917-0f5f-57bd-6415-cb9c27e4f338, server:Microsoft-HTTPAPI/2.0, strict-transport-security:max-age=31536000; includeSubDomains, x-aspnet-version:4.0.30319, x-backendhttpstatus:200,200, x-bepartition:CLEURPRD07AMS10, x-beserver:AS4PR07MB8684, x-beservicestate:Orphaned, x-cafeserver:AS4P192CA0048.EURP192.PROD.OUTLOOK.COM, x-calculatedbetarget:AM8PR07MB8248.eurprd07.prod.outlook.com,AM8PR07MB8248.eurprd07.prod.outlook.com, x-feefzinfo:AMS, x-feproxyinfo:AS4P192CA0048, x-feserver:DUZPR01CA0010, x-firsthopcafeefz:DUB, x-nanoproxy:1,1, x-proxy-backendserverstatus:200, x-proxy-routingcorrectness:1]>

This is causing us significant issues as we have customers using the MS Teams notifications and these are not delivered to them although our application is delivering the notification to Microsoft infrastructure and we do not know why we see these plus we are not able to fully understand the response format. Could you please share/point us in a direction where we can understand what is 100% OK and what is not as the replies start with 200 OK OK,

Thank you

PS: This is what good looks like (in form of a response received from Microsoft). from the same date and same webhook channel. 12:29:00.177 [http-nio-8050-exec-1] [] [] [] INFO i.v.n.g.teams.api.TeamsStrategy - Teams post notification response: <200 OK OK,1,[alt-svc:h3=:443,h3-29=:443, content-length:1, content-type:text/plain; charset=utf-8, date:Mon, 16 Oct 2023 12:28:59 GMT, ms-cv:j4T47zVwMmSIjJZfNSrHEg.1.1, request-id:eff8848f-7035-6432-888c-965f352ac712, server:Microsoft-HTTPAPI/2.0, strict-transport-security:max-age=31536000; includeSubDomains, x-backendhttpstatus:200,200, x-bepartition:CLEURPRD07AMS10, x-calculatedbetarget:AM8PR07MB8248.eurprd07.prod.outlook.com,AM8PR07MB8248.eurprd07.prod.outlook.com, x-end2endlatencyms:1855, x-feefzinfo:AMS, x-feproxyinfo:AS4P190CA0045, x-feserver:DUZPR01CA0035, x-firsthopcafeefz:DUB, x-nanoproxy:1,1, x-proxy-backendserverstatus:200, x-proxy-routingcorrectness:1]>

mrydz commented 10 months ago

Others reporting the same issue at https://answers.microsoft.com/en-us/msteams/forum/all/teams-webhook-error-despite-200-ok/920464d2-dd9c-4fae-847a-99855f6498f2 and https://learn.microsoft.com/en-us/answers/questions/1393552/teams-webhook-suddenly-stopped-working

Meghana-MSFT commented 10 months ago

Yes, this issue has been reported in other forums as well. There is an ongoing issue in Exchange Connectors. An incident has been raised related to this error - Web Hooks erroring randomly, unauthorized, the public key XXXXXXXXXXXXXXXXX = is not found for ring WW, cannot validate signature"}. We will keep you posted on the updates. Thank you.

Meghana-MSFT commented 10 months ago

The issue has been mitigated, could you please confirm if it is working fine now.

nssecrier commented 10 months ago

The issue has been mitigated, could you please confirm if it is working fine now.

Hello, the issue seems to have stopped on the 16th in our case but the main aspect is still present for us. We do not know the response message structure and it is difficult to parse and understand when should we retry a message based on current documentation. Outside of the retry mechanism related to the rate-limiting (429 errors) we do not know what other errors might be there. Can you please share with us additional details on the response structure and best way to identify what is good and what is bad? I am sure that everyone following this would find this beneficial, especially when it comes to the transient issues reported on the forums.

Thanks

Meghana-MSFT commented 10 months ago

@nssecrier - Apologies for the inconvenience caused. We have shared this feedback with engineering team.

Meghana-MSFT commented 10 months ago

We received the below update from engineering team - There's history behind 200 for errors. Lots of 3P systems will disable the webhook if any non-200 error codes are received and typically the person who setup the webhook in the 3P system is different than the users relying on the connectors for messaging which makes it hard to update the webhook URL if a new one needs to be provisioned after the previous one was disabled by the 3P system. So, O365 Connectors return 200 for what they think are retry-able. The response can be considered successful only if it returns a number in the response, else it is a failure even if status is 200. This is corresponding to the number of messages sent, but for most of the cases, it is just the number 1, that indicates successful response.