Adoxio / xRM-Portals-Community-Edition

The definitive edition of Microsoft Open Source Portals, supported by the experts in portals.
MIT License
107 stars 60 forks source link

Web Farms and Load Balanced Servers #81

Closed jliberta closed 5 years ago

jliberta commented 5 years ago

Is there a way to configure cache invalidation across servers behind a load balancer (If your portal is hosted on multiple front end servers) or is this taken care of by WebNotification.axd automatically?

amervitz commented 5 years ago

Cache invalidation requests to each server will work via WebNotification.axd if each server has its own externally accessible URL.

If deploying to Azure web apps, I believe the invalidation should be handled automatically. I haven't looked into the relevant code in this project yet but in later versions of Adxstudio Portals there was a portal bus cache invalidation feature that would detect that the code was running in an Azure web app and upon receiving web notification requests, the website would then write files to the file system in the App_Data folder to propagate cache invalidation messages to other instances of the web app.

In the scenario of non-Azure deployments that are load balanced but each website is not externally accessible, at one point Adxstudio Portals had the ability of specifying internally accessible URLs to the websites in a configuration file so that the invalidation messages could be propagated. I will need to do some research to figure out how that was done and see whether that feature still exists and can be used.

jliberta commented 5 years ago

Right we have begun testing it on Azure Web Apps because the cache invalidation is done in App_Data and shared across all instances.

What you are referring to can be found here https://community.adxstudio.com/products/adxstudio-portals/documentation/developers-guide/cache/cache-invalidation-in-a-load-balanced-network-web-farm/

The problem is I cannot find functionality for RemoteEndpointOrganizationServiceCache or an alternative so if its confirmed that there is no support we will have to continue with web apps.

Thanks for your response.

amervitz commented 5 years ago

The closest thing I can find is the ServiceDefinitionPortalBusProvider, however I don't see any code referring to it and it isn't immediately apparent to me how it would be used or if it would work.

jliberta commented 5 years ago

Observed the same on my end, there's no documentation on how it could be used, and does not seem to reference a service definition file that attempts to invalidate the cache of other remote endpoints. I guess it is safe to say its not supported and Web Apps are the desired route.

amervitz commented 5 years ago

I think so, time to embrace the cloud ☁️ 😃

amervitz commented 5 years ago

This topic has been documented in this wiki page:

https://github.com/Adoxio/xRM-Portals-Community-Edition/wiki/Web-Notifications-in-a-Load-Balanced-Environment

jayrodmcneil commented 5 years ago

Bah that's too bad, I can attest that using multiple web notification URLs does indeed work, but the trouble is that the plugin waits for a response from each of the webservers to complete its execution.

If one of those web servers is overloaded the webnotification plugin can take a long time to execute while it waits for a response or times out, I don't know exactly how long it waits.

Issue being that the asynchronous queue gets backed up because of this during busy periods where one of the web servers choke. Then all web Notifications are pretty close to useless and all other asynchronous operations are backed up, we had an issue recently which took about 4 hours of webnotifictaions to catch up once traffic died down on the portal (emails weren't sent for hours, web notifications weren't sent for hours, and portal server CPUs were maxed/flooded with webnotifictaion requests.

Uh oh...

Ideally the plugin would just throw the requests and ignore the responses so as not to get blocked. Or if it could send the request to a single endpoint which distributes the notification to other servers (I guess that's what adx v7 had...)

Any ideas welcomed :)

amervitz commented 5 years ago

Microsoft controls the distribution of the plugin and we can't make any changes to it as part of this project. They don't use the plugin in online portals so it's unlikely to get any updates.

The issue you describe could also happen with a single web server, it isn't a multi-instance issue.

This could be a good inflection point for your organization to start working towards using online portals where Microsoft rebuilt the entire cache invalidation system.

jliberta commented 5 years ago

Looks like we're not the only ones having this problem. We switched to Azure Web Services, we now use 6 instances of P2V2 VMs and still have performance issues with web notifications. The problem with the queue being backed up might also occur here in the event that the web notification gets sent to one busy instance which then takes a while to process before it gets placed in the portal bus. At this point I don't even think its beneficial to go up to the next tier which is P3V2 (4 cores vs 8 cores) since it seems certain types of operations cause CPU lock ups regardless of how many cores you have...might need to sacrifice some fresh data on the portal in order to reduce web notifications even more :(

godind commented 5 years ago

I think it should be shared well advertised that the XCE version is much slower than v7 and that it has performance/scalability limitations. It's hard to say if the problem is in the new plugin or caching engine because plugin source code has not been released, but it is without a doubt, significantly slower. MS internal resources has unofficially acknowledge that they killed v8 and redesigned the latest online cache update system for this very reason.

What we think is happening (with AWS type infra) is this:

  1. WebNotification processing is slow (much more than with v7)
  2. Queuing of WebNotification strangles the receiving instance (100% or very high constant CPU usage) for a potentially very long period of time
  3. Constant Cache updates sent via PortalBusProvider and propagates to all front-end servers
  4. All front-end servers fall into continuous Cache update pattern - cache update impacts end-user request response time. Servers sometime becomes unresponsive whilst CPU avg is at 50%.
  5. Occasionally create PortalBus shared AppData folder access contention and possible read access issues.

What we observed so far is adding more than 4-5 instances does not provide any benefit. WebNotifications volumes simply lock up the farm. CPU Power help speed up WebNotification processing time but has limits. So it won't scale beyond a certain point - It needs to be said.

This is after months of analysis, code optimization and WebNotification reduction work. The same solution would run flawlessly on 2 instances using v7.

Any way we can pressure MS to release the plugin code or provide a simple fix (as suggested above) to the plugin?

RicLund commented 5 years ago

@godind I agree with your sentiment - if it was clear this version contains significantly poorer performance than v7 at scale, I think many people would have reconsidered using it.

@amervitz I suggest this is re-opened so we can work towards any improvements possible. Also better warnings should be added to the readme stating these major issues for new adopters (the current vague "ideally don't go here" stuff isn't enough). Also since you've mentioned it twice - yes, most of us want to use cloud solutions but you need to realize there are certain industry sectors and clients which CAN'T yet consider cloud for certain solutions, or have significant necessary delays in getting there - this is just a reality, not something we can ignore, unfortunately. Few of us go on-prem by choice, there are drivers. Had we known this version would have crippling scalability issues we might have made other choices altogether, probably still not cloud for some customers though.

godind commented 5 years ago

@amervitz I think we should keep this issue open. Maybe rename it to Performance issue/limitations.

amervitz commented 5 years ago

@riclund and @godind here is a draft for a disclaimers section, let me know if you think it needs to be adjusted.


Disclaimers

This project is licensed under the MIT license, which provides access to the source code free of charge and without warranty of any kind.

Adoxio has not performed a detailed audit or testing of the source code after its release by Microsoft.

This project only contains the source code for the portal web application and its dependent class libraries. The associated solutions were not included by Microsoft as part of the one-time open source release - as such we are unable to fix issues or make changes to behavior of components contained within the solutions, including any and all components such as schema, plugins, and web resources.

Some users have observed poor application performance and scalability with this codebase compared to their prior experience with Adxstudio Portals v7. Before using this code in a production setting it would be advisable to perform adequate testing to ensure it meets your performance needs.


@riclund and @godind this issue was a question about cache invalidation that has been answered. Please create one or more issues if there are specific items that you feel need to be addressed so that we can track them accurately. I would like to lock this conversation soon because this issue is diverging into other topics.

godind commented 5 years ago

@riclund to your suggestion

Ideally the plugin would just throw the requests and ignore the responses so as not to get blocked. Or if it could send the request to a single endpoint which distributes the notification to other servers (I guess that's what adx v7 had

This is possible with Azure Web Services and PortalBusProvider type deployment. It will do internal Cache Invalidation between servers. The mechanism reduces avg CPU usage and improves farms response time but, it does really scale better. Meaning if the WebNotification flood the Front-ends, usually because you have a reasonable amount of traffic that triggers CRM backend processes and raises WebNotifications, you fall in the same pattern where the farm locks up. AWS and PortalBus is better but it does not fix the problem.

jayrodmcneil commented 5 years ago

I think that it's important to highlight that the Web Notifications are only a part of the challenge, which yes is what's being discussed here in this thread. However even without Web Notification traffic we're experiencing significant performance issues as soon as user load reaches 50-100 concurrent users per IIS server (in our limited testing anyways). This is being discussed in issue #97

On Tue, Jan 22, 2019, 11:53 AM godind, notifications@github.com wrote:

@riclund https://github.com/riclund to your suggestion

Ideally the plugin would just throw the requests and ignore the responses so as not to get blocked. Or if it could send the request to a single endpoint which distributes the notification to other servers (I guess that's what adx v7 had

This is possible with Azure Web Services and PortalBusProvider type deployment. It will do internal Cache Invalidation between servers. The mechanism reduces avg CPU usage and improves farms response time but, it does really scale better. Meaning if the WebNotification flood the Front-ends, usually because you have a reasonable amount of traffic that triggers CRM backend processes and raises WebNotifications, you fall in the same pattern where the farm locks up. AWS and PortalBus is better but it does not fix the problem.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Adoxio/xRM-Portals-Community-Edition/issues/81#issuecomment-456474573, or mute the thread https://github.com/notifications/unsubscribe-auth/Ad_amwMULcDvXmIRbLsHY4xj8MpU3Zs9ks5vF0IBgaJpZM4Wsb2O .

--

Cheers, Jason

jayrodmcneil commented 5 years ago

Hi all (adding @jliberta @godind and @amervitz in particular),

So we're now experimenting with Azure Web Apps to host the XRM portion communicating with on-premise CRM (through Hybrid Connection, which is working really well actually).

I can see the App_Data folder being used to propagate changes between the Web App Instances when the changes are bing made through the Portal, which yes would definitely cut down on the number of Entities which would require web notifications in our scenario.

However, we're having issues with Entities that we are genuinely updating in Dynamics and want to have those changes reflected in the Portal. We've configured the Web Notification URL in Dynamics to point to the Web App Url/Webnotification.axd and are getting a 200 OK response in the plugin trace log, and can confirm using Application Insights/Azure Web App logs that we're receving the POST request to Webnotification.axd. Unfortunately, it doesn't ever seem to actually invalidate the cache for any of the records being changed in CRM, and so the updated values are never reflected in the Portal.

I've monitored the App_Data folder as well and the changes don't appear to make their way into there at all. We've got the Web.config set up pretty well exactly as it's distributed with the Master Portal project, is there some added configuration which neesd to be done to enable this? I've looked at this: https://community.adxstudio.com/products/adxstudio-portals/documentation/developers-guide/azure/cache-invalidation-using-windows-azure-inter-role/ but I don't really think it's necessarily still relevant?

A side note, I did have to move my Azure Web App down to TLS1.0 as our Sandbox Backend Services were only sending Webnotification traffic using 1.0, I'm planning to fix that at the server level and move the Azure Web Apps back up to 1.2.

Any help would be super appreciated

jliberta commented 5 years ago

Hey @jayrodmcneil ,

In general Azure Web Apps does allow you to disable various entities which trigger web notifications because of the portal bus feature that gets enabled when your portal is deployed. That being said, if you disabled certain entities in you WebNotification solution in CRM then any CRM record updates on those entities would not trigger a web notification to the portal. Just make sure that if you want to see entity updates from CRM reflected in the portal then those entities need to be moved from the left pane to the right pane in the web notification solution.

The cache invalidation in the portal bus is a nice feature but tricky because entity updates made in portal will invalidate the in-memory cache of all instances of your web app regardless of if that entity is enabled or disabled in the CRM web notification solution but the invalidation is not made because of a CRM update, rather it is a message placed in the bus from one app instance to force the other instances to not use their cache for the next operation on the entity in question. If you want updates in the direction of CRM -> Portal you absolutely need to have webnotification on these entities enabled.

If this doesn't work I would recommend deleting your web notification URL and creating it again and then disable/enable/publish your web notification solution again and see if it helps!

Just remember that if you enable web notification on entities in CRM, when that entity is updated in the portal you will get an increase in cache invalidation messages, ex:

Portal updates an entity, then it places a cache invalidation message in the bus for all instances to clear their cache. Because entity in CRM was updated and web notifications are enabled, a web notification message will be sent to all instances to invalidate their cache once again. As such, given the performance of this solution being so poor, if the CRM updates in portal are not crucial and can wait a few hours before cache is refreshed and portal is updated, I would recommend leaving them disabled.

amervitz commented 5 years ago

@jayrodmcneil given everything you have said to validate the requests are being sent by CRM and received by the web app, the only thing I can think of is to compare the deployed web.config file to the version included in this project and ensure there are no significant changes; the only changes that would be expected are the addition of one or more connection string(s) and the addition of a machine key.

jayrodmcneil commented 5 years ago

Awesome guys, thanks for all the help and considerations! The recreate of the web notification url and republish did the trick. I guess when it was originally enabled/generated the token it must send it to the portal server to register the token, and it couldn't reach the portal at that time, when I later fixed connectivity it would send but not invalidate the cache because the lack of matching token... Is my guess :)

Thanks again!