Closed IOSven closed 8 months ago
Hi, yes this is all a known issue with using Lucene based indexes in Azure and load balancing. This is the reason why I created ExamineX so that you can have centrally managed hosted indexes instead of local Lucene indexes per node.
I've talked about this in great detail in a couple of talks:
There is no silver bullet to using Lucene based indexes on Azure, especially if you are load balancing. In order to even make Lucene indexes work in Azure even without load balancing a bunch of trickier needs to happen behind the scenes (i.e. %Temp% storage is required, etc...), then when you add in Load Balancing it gets even tricker because there is no central index, there is an index per node and as you say they could get out of sync for all sorts of reasons. They only stay in sync in Umbraco based on Umbraco's cache refreshers. Now if you add slot swapping to the mix, then things probably get even more complex.
What is the answer? Well, ExamineX with Azure or Elastic search is the best answer since it solves all of these issues. However, if you choose to continue to try to use Lucene based indexes in Azure + Load Balancing than there might be some options but will require custom implementations. For the most part, indexes will stay in sync with the CM but due to slot swapping, here's what happens:
When you swap your staging for your live, your staging site will have a local index based on the staging information from your CM staging site since it has only been kept in sync with your staging CM. This means that the local index on this node will need to be rebuilt so that it is in sync with your live database data. Similarly, the nucache file will also only be in sync with your staging CM, not your live CM so I'm not sure how you are currently working around this?
Indexes (and nucache) will be rebuilt automatically by Umbraco based on whether it is a cold boot ... This would be the ideal way to deal with this scenario. If your staging site (which is in sync with your staging DB) becomes live, then it will not be in sync and a cold boot should be executed. I'm not sure why this isn't documented anywhere on Umbraco docs site but to force a cold boot, you can clear out the umbraco/Data/Temp/DistCache
folder. This is the folder that maintains txt files that indicate which 'instruction' Id in the database that it is in sync with. If this file doesn't exist, then a cold boot will be initiated (based on this code https://github.com/umbraco/Umbraco-CMS/blob/contrib/src/Umbraco.Infrastructure/Sync/LastSyncedFileManager.cs). You can see this technique is used in some Umbraco tests themselves: https://github.com/umbraco/Umbraco-CMS/blob/ce769ffff4613904bcc8c65103166a722db388a5/tests/Umbraco.TestData/LoadTestController.cs#L325
Perhaps when a site it swapped, it is not restarted which would mean that a cold boot doesn't take place since there is not re-boot? That is something you would need to investigate and would also depending on how you are doing the swap. If it is done programmatically, then you could probably force delete that folder and then do the swap.
Actually, looking at the swap docs, the source site is restarted, but a cold boot will probably not occur because it has it's last synced file. If you could programmatically swap the slots, then you could first clear that folder and initiate the swap, this should cause a cold boot during its restart while it is now pointing to your production database. Alternatively, you could use utilize custom warmup https://learn.microsoft.com/en-us/azure/app-service/deploy-staging-slots?tabs=portal#Warm-up and initiate index rebuilds. FYI, this is how Umbraco rebuilds indexes on startup so you probably don't want to conflict with its own operations https://github.com/umbraco/Umbraco-CMS/blob/contrib/src/Umbraco.Infrastructure/Examine/RebuildOnStartupHandler.cs
This handler waits for one minute after the first http request is made to initiate the rebuilds (so that it doesn't interfere with site bootup/loading). If it is a cold boot, it will rebuild all of them, else only empty ones. You could potentially copy the code in this handler, remove the default one and add your own with custom logic to force cold boot re-indexing if you know the site has just been swapped.
These are only ideas I'm coming up with, but essentially, this is all based on Umbraco logic, not Examine.
Hi @Shazwazza ,
first of all thank you for the clear explanation! We've currently added a custom workaround where we use hangfire to run a recurring examine health check background task every hour. On top of that we also run a single examine health check background task on startup with a delay of 3 minutes to give Umbraco some time to startup.
This health check is based on this piece of Umbraco code. We simply check the document count / fieldContent / isHealth.Success bool, and if there is any problem we execute the following code:
if (_indexRebuilder.CanRebuild(indexName))
{
_indexRebuilder.RebuildIndex(indexName);
_logger.WriteHangfireConsole(LogLevel.Information, performContext, $"The index '{indexName}' is being rebuilt in the background.");
}
We're also looking to upgrade to Umbraco v13 and we're wondering if this bugfix would help us at all since they mention the following:
This should help with Azure App Slot Swapping indexing locking as the SiteName property can be made sticky to each slot with a different value (as we can do for the published cache already). It also can be used when debugging locally with multiple launch profiles
Thanks! Sven
@IOSven yes that change will help because of how the DistCache files are named along with the naming conventions for the index folders. This is probably why nucache works for you today with slot swaps but not Umbraco.
We've currently added a custom workaround where we use hangfire to run a recurring examine health check background task every hour. On top of that we also run a single examine health check background task on startup with a delay of 3 minutes to give Umbraco some time to startup.
Please be aware of over index rebuilding. Rebuilding should only be done when necessary. There is a heavy database penalty for the queries it executes, plus this can cause your editors to have db lock timeouts because of how long the query takes and if someone is actively trying to edit content.
We simply check the document count / fieldContent / isHealth.Success bool, and if there is any problem we execute the following code
But how does this check if the index is in sync with the CM database?
Hi @Shazwazza,
We're indeed only rebuilding our indexes if we really have to when the document count is 0.
We're running this examine health check on both our CD & CM environments. The health check doesn't currently check if the index is in sync with the CM database, but instead the automatic job only checks if their own indexes are healthy with correct document/field count etc.
Would there be a better alternative workaround maybe that you could think of?
Thanks!
Its just your original question is directly relating to your indexes getting out of sync. It sounds like the health checks you have implemented don't actually check for whether they are in sync or not, and only if they are empty - Please note, Umbraco will automatically rebuild them if they are empty on startup so you shouldn't have to handle that yourself either.
To determine if your indexes are in sync would require some custom logic that doesn't currently exist. There would be a few ways to try to do that but the most ideal way would be for the node to simply query the local index for a specific record with a specific value and if it didn't match it would mean it is out of sync. How you would do that or other alternatives I'll have to leave up to you. Again, the most ideal way to deal with indexes, load balancing and azure is to use a hosted search service like Azure/Elastic search and use ExamineX, then there's nothing to worry about.
Hi @Shazwazza,
We've been experimenting with examine x and we also bought a paid license. Everything was up-and-running without problems on our test/acc environments but unfortunatly we noticed performance issues on our production environment.
The implementation of Examine X in our project is currently on hold until further investigation. We will create an issue under the Examine X issue tracker when we have more information.
Thanks @IOSven for the info. Happy to assist on the ExamineX tracker regarding any performance investigations. The only performance concerns with ExamineX would simply be latency due to HTTP requests when searching or indexing but there is far less overhead on the local CPU than Examine since there is no underlying Lucene engine. Would be interesting to see where your bottlenecks are/were.
Which Umbraco version are you using? (Please write the exact version, example: 10.1.0)
v10.8.2
Bug summary
Hi, we keep struggling with the use of Lucene indexes on our loadbalanced environment.
For each environment (test, acceptance, production) we've got 2 seperate web apps, one for content delivery (= CD = FE Only) and one for content management (= CM = Backoffice only).
We've added the recommended configurations for maindomlock, localTempstorageLocation & luceneDirectoryFactory as described on this page: https://docs.umbraco.com/umbraco-cms/fundamentals/setup/server-setup/load-balancing/azure-web-apps
We've also configured an explicit schedulingPublisher & subscriber as mentioned in: https://docs.umbraco.com/umbraco-cms/fundamentals/setup/server-setup/load-balancing/flexible-advanced
We noticed that our examine indexes on our cd web apps are not in sync with our CM web app examine indexes even though we're only using the default internal & externalIndex that we've only extended by using the TransformingIndexValues eventHandler.
Example:
How do we notice that our CM/CD webapp examine indexes are not in sync?
After some more investigation, we notice (or atleast think) that the problem has to do with the swapping in combination with the load balancing .
We've already created an umbraco support ticket to further escalate this issue since our client is becoming impatient as they encounter this problem everyday on their production environment. Umbraco support gave us the following update:
Umbraco support:
My answer:
Umbraco support:
unfortunately Umbraco support cannot provide us with a specific example of this code implementation since they mention they do not have any documentation on it. Do you have an idea if this would fix our problem and if so, how we could implement this?
Also - Currently we're not using swapping on our production environment, but this environment is also affected by the same issue.
Thanks in advance! Sven