SharePoint / sp-dev-docs

SharePoint & Viva Connections Developer Documentation
https://docs.microsoft.com/en-us/sharepoint/dev/
Creative Commons Attribution 4.0 International
1.25k stars 1.01k forks source link

Random 500 internal server errors with different APIs and operations at SharePoint Online #4924

Closed advdberg closed 4 years ago

advdberg commented 5 years ago

Category

[X] Bug [ ] Enhancement

Environment

[X] Office 365 / SharePoint Online [ ] SharePoint 2016 [ ] SharePoint 2013

Expected or Desired Behavior

Template is applied without errors (or at least with detailed errors)

Observed Behavior

We get an Intermittant 500 server error on applying a template with the following tracelog:

powershell.exe Information: 0 : 2019-10-08 14:40:40.6587    [OfficeDevPnP.Core]    [0]    [Information]    Adding field (382aae1d-7054-4b5d-85ca-ead7dfdd96f0) to content type (0x01010043443884AE06CF4D9F9C2D75C47BF75A).    0ms   
powershell.exe Information: 0 : 2019-10-08 14:40:42.7683    [OfficeDevPnP.Core]    [0]    [Information]    Adding field (eea8f2b4-9508-46a0-9f42-69008561a746) to content type (0x01010043443884AE06CF4D9F9C2D75C47BF75A).    0ms   
powershell.exe Error: 0 : 2019-10-08 14:40:42.8361    [OfficeDevPnP.Core]    [0]    [Error]    ExecuteQuery threw following exception: System.Net.WebException: De externe server heeft een fout geretourneerd: (500) Interne serverfout.
   bij System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
   bij System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.SPWebRequestExecutor.<ExecuteAsync>d__0.MoveNext()
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.ClientRequest.<ExecuteQueryToServerAsync>d__6.MoveNext()
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.ClientRequest.<ExecuteQueryAsync>d__0.MoveNext()
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.ClientRuntimeContext.<ExecuteQueryAsync>d__0.MoveNext()
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.ClientContext.<ExecuteQueryAsync>d__4.MoveNext()
--- Einde van stacktracering vanaf vorige locatie waar uitzondering is opgetreden ---
   bij System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   bij System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   bij Microsoft.SharePoint.Client.ClientContextExtensions.<ExecuteQueryImplementation>d__7.MoveNext().    0ms   
powershell.exe Information: 0 : 2019-10-08 14:40:42.8371    [Content Types]    [15]    [Debug]    Code execution scope ended    115033ms    57b414c0-2d01-471d-b956-b5bab9c3667f
powershell.exe Information: 0 : 2019-10-08 14:40:42.8371    [Provisioning]    [15]    [Debug]    Code execution scope ended    168283ms    57b414c0-2d01-471d-b956-b5bab9c3667f

Steps to Reproduce

execute the following PowerShell: Apply-PnPProvisioningTemplate -Path "C:\temp\SiteTemplate(1).xml" -Verbose

bryqu commented 4 years ago

Same issue for me. Frantically waiting for MS to resolve this!

willemdeboer commented 4 years ago

Same issue for me and our customers. Random 500 internal server errors are thrown during provisioning with CSOM. We create a modern teamsite and afterwards we start provisioning with CSOM. We don't even use PnP Provisioning, however we support that as an additional step, but since November 14th the provisioning randomly fails while adding lists or content types or whatever to the site.
BTW: Our customers who still use classic teamsites don't seem to have these issues.

advdberg commented 4 years ago

Advisory is now upgraded to 'Incident' in admin center Service Health with following status report:

Current status: We've identified a potential service update which may be the source of the impact. We're reviewing the specific configuration and code changes within the update to isolate the cause of the issue.

So fingers crossed it will be fixed and rolled-out before the weekend so we then have time to fix all failed provisioning jobs. Happily we're in SP online so we don't have to wait for the release of a Service Pack to get it fixed ;-)

VesaJuvonen commented 4 years ago

This issue is still being actively worked on. No ETA at this point.

Thanks everyone who have already reported the issue from their environment as that helps to speed up the investigation process.

jpalo commented 4 years ago

Opened level A Premier ticket, not that I think they could help me specifically, but will update here if there's something new.

In addition to API, we're also seeing issues on OOB list item editing functionality, not a big surprise IMHO considering issue is somewhere deeper.

cwdata commented 4 years ago

We had these issues in our Web API using the SharePointOnline CSOM libraries. Upgrading to the latest stable version of PnP CSOM Extensions and the Microsoft CSOM library through Nuget seems to have fixed it. Maybe that'll work for others as well.

jackpoz commented 4 years ago

I'm getting the HTTP 500 using SharePointPnPCoreOnline NuGet package version 3.15.1911 on a demo tenant created from https://cdx.transform.microsoft.com/ (but it doesn't happen always)

MikkoKoskinen commented 4 years ago

Also, have a premier ticket open regarding this. We also did send some log files and PnP template just in case. Thank you for keeping us informed of the progress @VesaJuvonen . It helps us to communicate this to our clients or organization better. Not an ideal situation, but I'm sure you will find a solution.

TOPDHI commented 4 years ago

We're on latest NuGet SharePointPnPCoreOnline ver. 3.15.1911 and Microsoft.SharePointOnline.CSOM 16.1.19404.12000 and still having random issues @cwdata

h3rd4 commented 4 years ago

We're on latest NuGet SharePointPnPCoreOnline ver. 3.15.1911 and Microsoft.SharePointOnline.CSOM 16.1.19404.12000 and still having random issues @cwdata

issue is probably on the server side, CSOM version doesnt matter

jpalo commented 4 years ago

Issues are not related to package versions, no need to continue that discussion 😐

Premier ticket went as suspected, and we lowered the severity as it is already being investigated. Did send some of our error messages and they added our tenant to the list of affected tenants in the internal ticket of the issue.

sandeepvootoori commented 4 years ago

Same issue here. Is it still good idea to create a MS ticket?

SchauDK commented 4 years ago

Current status: We've isolated down the probable causes to a few changes that were recently made to the SharePoint Online service. We're continuing our investigation to confirm our findings and develop a mitigation plan.

This definitely got worse starting a few days ago, however we started seeing this more than two months ago. Hence implemented our own retry logic within the OfficeDevPnP framework (ExecuteQueryRetry).

Maybe there's more to it than "a recent change". Maybe Microsoft can elaborate on what the recent change is??

SchauDK commented 4 years ago

Issues are not related to package versions, no need to continue that discussion 😐

Premier ticket went as suspected, and we lowered the severity as it is already being investigated. Did send some of our error messages and they added our tenant to the list of affected tenants in the internal ticket of the issue.

If they want a list, I can add 100+ tenants!!!! ;-)

MikkoKoskinen commented 4 years ago

This definitely got worse starting a few days ago, however we started seeing this more than two months ago.

I agree I've notice the same strange behavior in provisioning during the last two months also. I just wasn't sure where the error has been and retrying has been helped. Before at least.

richardb52 commented 4 years ago

Current status: We've isolated down the probable causes to a few changes that were recently made to the SharePoint Online service. We're continuing our investigation to confirm our findings and develop a mitigation plan.

This definitely got worse starting a few days ago, however we started seeing this more than two months ago. Hence implemented our own retry logic within the OfficeDevPnP framework (ExecuteQueryRetry).

Maybe there's more to it than "a recent change". Maybe Microsoft can elaborate on what the recent change is??

Agreed, we first hit this issue September 23rd. Definitely occurring a lot more now though.

sandeepvootoori commented 4 years ago

This definitely got worse starting a few days ago, however we started seeing this more than two months ago. Hence implemented our own retry logic within the OfficeDevPnP framework (ExecuteQueryRetry).

I agree as well, We have been seeing this error since couple of months. Was never able to reproduce when retried.

sandeepvootoori commented 4 years ago

Am i the only one feeling like this is just getting worse?

DaniCorretja commented 4 years ago

Am i the only one feeling like this is just getting worse?

I feel you, bro =(

Would it help to open another ticket to Microsoft? I have the feeling they should have received hundreds because this is huge.

VesaJuvonen commented 4 years ago

Every single ticket has an impact. Please do not assume that your input or feedback would not have value for Microsoft as all of them do have direct influence on the following actions.

We are still actively working on this, but would absolutely ask all people suffering on this issue to report it through Premier Support or standard tenant admin support channel as each and every submission has an impact getting things resolved.

We do apologize for the inconvenience, but please do keep on reporting the issue if you are experiencing it. Thank you.

VesaJuvonen commented 4 years ago

Also - if any ISVs do have numerous tenants experiencing the issue, please do use Premier Support or tenant admin support channels to report that. Thank you.

AndyBolam commented 4 years ago

@SandeepVo @DaniCorretja We noticed that some PnP Provisioning tasks we were running locally against our tenant in September were failing, but as we were having ISP issues at the time we thought the server issues were being caused by our connection. As you both mention, running them again fixed the issue so at the time we were unable to diagnose this properly.

Around the same time in September we were updating some of our Azure Runbooks that were also using the PnP Provisioning Engine/Automation modules and they didn't seem to be failing - but the modules and engine fell over is they were ran locally. Again same intermittent issue so we added a retry into the Runbook and this seemed ok going forward.

What alerted us to this incident recently was a similar solution we'd built for a client, who suddenly was unable to create any new sites - we then figured it was the same 500 error we had briefly seen a couple of months ago, but this time it was failing on every run.

So it would seem this is something that was possibly affected a while back by other changes, but then got really bad a few days ago due to another change?

Anyway, for the first time in a few days, I just ran a successful provisioning task locally using PowerShell, all seems fine. Just need to check on the Azure Automation side for any issues. Will report back if I still see any issues (here and to support).

ghost commented 4 years ago

Looks like it is more over the whole o365 platform, then only the sharepoint, yesterday there were 2 incidents in the health center. Today only the one for office 365, but the problem is much wider then before

AndyBolam commented 4 years ago

Update - unfortunately when I run the PnP Provisioning via Azure Runbook in my client's tenant this is still failing. Presume that this is an ongoing thing so will check again later today.

VesaJuvonen commented 4 years ago

Potential fix has been applied few hours ago, so we are curious on hearing the status also using this channel. Is the situation any better starting from now or not for your environments?

Thank you for the status updates advance.

SchauDK commented 4 years ago

No change here.

image

ghost commented 4 years ago

had this morning first time a get pnpprovisioningtemplate and apply successful, but second run was still error

VesaJuvonen commented 4 years ago

thx @SchauDK - let's follow up the situation for upcoming hours. It also might be that the fix has not yet been properly applied to your tenant, but it should be in progress. We are getting good messages from some customers, so looking positive for now.

jpalo commented 4 years ago

Seems to be working at least in one of our case with rather heavy API usage.

ChrisOMetz commented 4 years ago

@VesaJuvonen - You mentioned that it might take some time before this fix will be applied to all tenants. Is there any option to check if this fix already applied to a specific tenant? Seams to work in one of my tenants but for example not in any dev tenants

VesaJuvonen commented 4 years ago

@ChrisOMetz - unfortunately there's no way to check from tenant level if the fix is already applied. It should be worldwide deployed/enabled within next 4 hours which should remove then unnecessary 500's... You can still have exceptions, like 429's or 503's which are throttling related, but if your code is CSOM and using ExecuteQueryRetry method, it will automatically handle these situations.

joelfmrodrigues commented 4 years ago

Seems fine for us now πŸ˜ƒ

MikkoKoskinen commented 4 years ago

In general, everything seems to be working for us. I still see some 500 errors, but our code was handling those situations anyway. I'll give it couple of more hours and retest.

advdberg commented 4 years ago

Started running some load tests on several 'internal' tenants, so far so good! Thanks for sharing the early status @VesaJuvonen, we will proceed with tests also on customer tenants.

AndyBolam commented 4 years ago

Am seeing

Exception System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host

in my Azure Runbooks (which are running PnP PowerShell scripts.

Any chance this is the same issue? Should I be logging this separately to the issue already known?

VesaJuvonen commented 4 years ago

@AndyBolam - does not look like a same issue in this case.

sandeepvootoori commented 4 years ago

Am seeing

Exception System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host

in my Azure Runbooks (which are running PnP PowerShell scripts.

Any chance this is the same issue? Should I be logging this separately to the issue already known?

Wierd but I also have been seeing this issue on our tenant intermittently

VesaJuvonen commented 4 years ago

thx @SandeepVo - This issue is around the 500 exception on the CSOM APIs, so let's not combine multiple different issues on single item. thx.

SchauDK commented 4 years ago

I guess the fix isn't globally deployed yet. I'll check again later. image

PhilThome commented 4 years ago

It's better than yesterday, But we still encounter the error in about 50% of out provisioning trials.

h3rd4 commented 4 years ago

For our dev and prod tenant is seems it works now. 3 provision requests within 1 hour and all has been processed successfully.

TOPDHI commented 4 years ago

In our tenant we believe that it is fixed. No 500 error since noon. Jupiiii.... Thank you @VesaJuvonen we can celebrate the fix in ESPC 2019 in Prague in 2 weeks πŸ‘

richardb52 commented 4 years ago

Looks to be resolved for us, thank you!

frnk01 commented 4 years ago

@VesaJuvonen Good to hear it’s fixed but I wonder how a serious issue like this can run in production for over a week and Microsoft not noticing it without community feedback. Major lack of monitoring on the API?

VesaJuvonen commented 4 years ago

@frnk01 - Just to be clear here. Microsoft engineering did acknowledge and detected this issue in the background already, but as we started to have reports using multiple different social media channels and other forums, we wanted to also have public and open communications around this within this issue to provide more transparency and visibility on the progress of the issue.

This transparency helps all sides on the discussion and we also wanted to encourage people to use the normal support channels to report the issue as that's also the preferred option for any future issue. Obviously works was already being done on the background to address the root cause.

VesaJuvonen commented 4 years ago

Thanks everyone for your input around this issue and we do apologize the inconvenience potentially caused by this for you. As the root cause for this particular issue has been now addressed, will be closing this issue from here. We are working internally at Microsoft to minimize possibility of similar issues in the future.

If you have any other issues which seem similar, please do open a new issue in here and open a Premier Support case, where suitable.

SchauDK commented 4 years ago

@VesaJuvonen As this is considered solved, does that mean that we shouldn't see the error at all or should we see less? I'm asking because we're still seeing it and I'm wondering how it will look on Monday when all our customers will start hitting SharePoint again. image

ghost commented 4 years ago

So it looks it was resolved last Saturday, in the evening i run some local scripts from different tenants and some azure function scripts. So seems to be over.

SchauDK commented 4 years ago

It's not over! We have 53 occurrences within the last hour.

image

PhilThome commented 4 years ago

1 of 4 provisionings still showed this error this morning.