Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
51 stars 47 forks source link

ServiceControl cannot run on Linux Docker container #1754

Closed tcomben closed 1 year ago

tcomben commented 5 years ago

We cannot utilize the benefits of NServiceBus monitoring and debugging without being to run ServiceControl in a Linux Docker container.

janovesk commented 5 years ago

Hi, @tcomben!

Unfortunately, some parts of the Service Control implementation are coupled to a Windows runtime ATM. We are very aware that this affects peoples efforts to utilize containers and are moving towards relaxing these constraints in the future.

Can't give you a specific date or release, but rest assured we're on this.

danielmarbach commented 5 years ago

@tcomben We'd be curious to understand what limitations you have in your environment that only allows you to host linux containers? Can you elaborate a bit your situation? This would help us a lot to better understand the decision making companies go through that are doing .NET, that arguably came from Windows and most .NET / CSharp developers have a lot of background there as well, and now are moving to linux only.

HSIChaosMonkey commented 4 years ago

@danielmarbach

I also raised this concern about a month ago. We've spent quite a bit of time evaluating windows-based containers over the course of the last 2-3 years. We've determined they are inadequate for our needs for a number of reasons. Even in a more limited role like ServiceControl/Pulse, they introduce limitations or complexity at a cluster management level (e.g. Kubernetes). As a result, the next generation of our products is being implemented using Linux containers and .NET Core exclusively. A pair of our most senior developers are investing in a POC to ensure we have an accurate assessment of these tradeoffs based on the current capabilities of both Windows and K8's. But it would be really helpful for Particular to establish/share their design intentions for allowing these services to run on Linux.

thoraj commented 4 years ago

@danielmarbach

I'll also chime in with my $0.02.

We use .net core based services deployed to a run-of-the-mill AKS cluster. The AKS cluster has a single nodepool set up to run Linux containers. This is what you get using defaults when setting up the cluster. Running .Net core in linux based containers in AKS works very well. Since this works well we do not wish to change to a windows container only cluster as it feels like leaving the "happy path".

There is a possibility to mix container types by having the AKS cluster use multiple node-pools, one of which may be set to run windows containers. So even if it is technically possible set up multiple node-pools and configure so ServiceControl will be scheduled on a windows node, it will increase the cluster complexity and management burden.

We're really hoping that in the future we can run the ServiceControl and all Particular components in linux clusters.

boblangley commented 4 years ago

Thank you both for the feedback and insight it brings. This information is valuable as it allows us to understand a broader context while we progress ServiceControl over the hurdles required to support Linux.

brookpatten commented 4 years ago

@danielmarbach

I'll respond as well. We're migrating from windows VMs to azure app service linux for hosting our .net core endpoints. If not for needing servicecontrol we would not need to maintain any VMs at all.

There are 3 main reasons we'd like to do away with the windows VMs. We're trying to get out of the "OS Maintenance" business entirely and just use containers maintained by our cloud platform. We also would like to remove "windows administration" from the list of skills needed by our team to deploy our application. Finally is cost, it's much more cost effective for us to only pay for a lightweight container like alpine to host an app rather than a whole VM.

We would very much like the ability to run servicecontrol in a linux container.

richard-bf commented 4 years ago

I have a couple clients in the same boat, migrating to .net core => docker. Having this in a ljnux container would greatly simplify things.

schehlmj commented 4 years ago

Running on linux/kubernetes would be very helpful. Is supporting running on linux on the roadmap? Is there a release/timeframe expected?

kbaley commented 4 years ago

@schehlmj It's on our radar and I've made a record of your interest to help with the prioritization. At the moment, we don't have any firm timelines but if you follow the issue, we'll update it when we have more concrete information.

WilliamBZA commented 3 years ago

We've added Dockerfiles for windows for ServiceControl. At the moment you will have to build them yourself, but we'll be pushing them to dockerhub as part of our build process in the not too distant future.

We still can't yet run them on linux, but we're actively working towards being able to.

You can find the dockerfiles in the src/docker folder, and instructions on how to build and run them in the readme.

We will update on this issue again once we change our build tools to push the images to dockerhub automatically.

thoraj commented 3 years ago

Any ETA for ServiceControl in Linux based containers?

We just launched our service, and ServicePulse is sorely missed.

WilliamBZA commented 3 years ago

Still no concrete ETA for linux containers. The only remaining blocker is our use of Raven 3.5. We are not yet ready to move to a .netcore version of Raven as we ran into performance problems while attempting the upgrade. We're re-evaluating the alternatives and a way forward. We will not stop trying to migrate ServiceControl to .netcore until we have achieved it.

schehlmj commented 3 years ago

@WilliamBZA Just checking to see if there is any ETA for Linux containers. Any update? Thanks!

WilliamBZA commented 3 years ago

No, still no ETA. The data storage is still the blocker.

jasondentler commented 3 years ago

ServiceControl is the only part of our tooling that isn't either a hosted SaaS offering or a linux container.

A Linux-compatible RavenDB has been available for quite a long while, so this capability isn't blocked by the underlying technology any longer.

WilliamBZA commented 3 years ago

Update March 2022: The blockers identified in this comment have been removed in a later version of RavenDB 5. See this comment for details.


Hi everyone, sorry for the late reply. We wanted to make sure we didn't miss anything in our explanation of why we have not updated ServiceControl to use RavenDB 5, which would allow ServiceControl to run on Linux. We hope that by explaining the difficulties we experienced, and the alternatives we're pursuing, you'll understand why we can't update ServiceControl to use RavenDB 5.

TL;DR

When the document expiry process is running, the performance of RavenDB 5 does not meet the requirements of ServiceControl.

We tried to do it

We spent several weeks updating ServiceControl to use RavenDB 5, reworking the underlying storage engine in the process. We started to see performance problems as soon as we had the changes in place and fully tested.

Some of the performance problems are described next.

Document expiry significantly impacts ingestion performance

We used the built-in document expiry feature that ships with RavenDB 5 for two reasons:

  1. We don't own the RavenDB process and can no longer easily plug in our own retention policy assembly.
  2. Attachment deduplication allows us to clean up message body attachments when cleaning up processed messages by default.

We found that when the document expiry process runs, it locks the database and prevents all other transactions from completing. This often leads to bulk insert operations timing out after 30 seconds. Unfortunately, this problem seems to get steadily worse. The document expiry process causes large fluctuations in audit ingestion performance and, eventually, ServiceControl crashes.

The following diagrams show audit ingestion performance over time. In the middle, we hit the 24 hour mark, which is our audit retention period. At that point, the first documents start to expire.

This diagram shows 15 minute moving average throughput. You can see how spikes become smaller, and the gaps between them larger, as more and more expired documents are removed:

image

This diagram shows that batch size becomes unstable when the document expiry process is active:

image

Note that, even before the document expiry process ran, ServiceControl was able to maintain an ingestion rate of 100 messages/second. This is the bare minimum we consider acceptable.

Runaway index journal significantly impacts startup performance

The MessagesViewIndex write-ahead journal grows continuously while the system is ingesting audit messages.

When the database is initially loaded after a restart, the entire write-ahead journal is consumed to rebuild state. Until that is done, the database is locked, effectively preventing ServiceControl from ingesting new messages or responding to queries. During this time, RAM usage steadily increases in steps, drops to zero, and then steadily increases in steps again. This can happen multiple times before the database becomes available and it's not clear how long the process will take.

In one test we saw the write-ahead journal reach 37 GB. Raven DB took one hour to consume it after a restart.

This problem is exacerbated by enabling full-text indexing on message bodies, which causes the write-ahead journal to increase in size much faster.

We've raised this with Hibernating Rhinos and they are working on a fix that will prevent Raven DB from re-using a journal file that is "too big". At this time, we are not certain that this will stop the journal file from growing continuously.

We tested a pre-release build and the journal was significantly smaller. Given the same inputs, the size decreased from 37 GB to 1.6 GB. It still took ServiceControl DB 3.5 minutes to become responsive after a Raven DB restart.

Bulk insert operations are not auto-flushed

For performance reasons, all inserts are made using a RavenDB bulk insert operation. In discussion with Hibernating Rhinos, we learned that using a long-lived bulk insert operation is better than many short lived ones. When we tried that, we discovered that a bulk insert operation does not flush a record to the server until the next record is added or until the bulk insert operation is closed. That means it's not safe to release messages in a batch until the bulk insert operation has been completely disposed.

We did a proof-of-concept where we changed the behavior to push messages to the bulk insert operation immediately, as soon as they arrive (using a channel to avoid multi-threaded calls), and to flush the bulk insert operation when there are no longer any incoming messages. The proof-of-concept did not show any differences in behavior at a throughput of 100 messages/s.

What does this mean for you?

After realizing RavenDB 5 is not suitable for ServiceControl, we started work on using other storage solutions. Our first goal is to enable Azure users to use CosmosDB. This will, for example, allow you to run ServiceControl as either a Windows or Linux Docker image in Azure, using CosmosDB for storage instead of RavenDB. After this first milestone is complete, we'll begin work on other storage solutions.

This is important to us. ServiceControl is a key component in the Particular Service Platform and we want you to be successful with it. We understand the frustration of having to spin up a Windows VM or Docker container solely for the purpose of running ServiceControl when everything else runs in Linux or a PaaS service. We look forward to the day when these restrictions are lifted. Until then, we'll keep working toward that goal and ensure you're informed about our progress.

How can you track our progress?

Two ways:

  1. Subscribe to notifications on the GitHub issue.
  2. Email us at support@particular.net requesting to be notified about updates.

We'll provide regular updates on both channels.

jasondentler commented 3 years ago

Thanks for the detailed explanation of your struggles with RavenDb 5.

If CosmoDb was an option for me, I could spin up the existing Windows containers anyway. Unless I've missed something, I don't see how this move reaches anyone new. Why not invest in a database that's already containerized (for the self-hosted K8s) and offered as a service by several cloud providers?

WilliamBZA commented 3 years ago

CosmosDb is the first step, it's not the end. We will likely end with a number of different storage offerings to match different environments and configurations.

As for reaching anyone new: This unlocks containers (both windows and linux) for all Azure users. At the moment, these users are completely blocked by the lack of durable storage in ACS, and a need for Server 2019 as base images. Meaning at this point, Azure users need to have an actual VM to use ServiceControl.

Varorbc commented 2 years ago

@WilliamBZA Just checking to see if there is any ETA for Linux containers. Any update? Thanks!

thoraj commented 2 years ago

Our service/stack must be deployable without access to Azure. For this reason we use RabbitMQ for transport and PostgreSql for persistence.

Is there any chance of ever getting ServiceControl using PostgreSql? Is something like this being investigated or planned? Or will k8s deployments always be required to use CosmosDb?

DavidBoike commented 2 years ago

@Varorbc unfortunately no, we don't have an ETA.

@thoraj It's possible we might do that at some point, but as @WilliamBZA stated above, currently the focus is on customers in Azure.

Varorbc commented 2 years ago

@DavidBoike where can I get ServiceControl for CosmosDB?

kbaley commented 2 years ago

@Varorbc It's not available yet. We're still reviewing options for cloud customers and CosmosDB is definitely on the list. We'll update this issue when we have something concrete.

NArnott commented 2 years ago

Just wanted to add our desire for a non-azure linux container option. We're fully in AWS, and won't have CosmosDB as an option. Otherwise, we're currently set up with a linux-only K8s cluster and would really like to deploy this inside the cluster.

cquirosj commented 2 years ago

@NArnott thanks for your input. ComosDB is only the first goal, more storage solutions will follow.

jbakholt commented 2 years ago

This gets my vote too. Running Service Control in a Linux Container with external persistence would check a lot of boxes.

Would Elastic Search be a good fit for Service Control Persistence?

mauroservienti commented 2 years ago

We’d like to provide an update on our path towards supporting Linux containers for ServiceControl.

We recently completed a set of spikes to test ServiceControl audit ingestion using the latest version of RavenDB 5 and SQL Azure.

Audit indigestion was selected for the spikes because it is the ServiceControl feature that puts the biggest pressure on the storage engine.

We discovered that the latest version of RavenDB 5 removes the blockers identified when we originally tested it.

Using SQL Azure appears to be feasible although storing message bodies in the database may be expensive.

In summary, we’re confident we’ve found at least one path to running ServiceControl in Linux containers but we still need to spike more storage options before deciding which to select.

schehlmj commented 2 years ago

For our planning, is there a milestone that this is expected to be in?

mikeminutillo commented 2 years ago

@schehlmj we don't have anything to announce at the moment, but stay subscribed to this issue and you will be informed when there is progress

moanrose commented 2 years ago

Yes! The people want Service Control on linux containers!

I would vote for a storage option that were cloud agnostic and containerized (RavedDB)

Keep up the good work!

JanosNollFD commented 2 years ago

I am also definitely looking forward to having Service Control in containers, whenever it is out then we are ditching the Windows VMs for sure!

kbaley commented 2 years ago

Hi all

We're upgrading the database to RavenDB 5 as outlined in an earlier comment. This work is starting shortly. With that major roadblock out of the way, we'll be in a better position to get ServiceControl capable of running on a Linux container so everyone can ditch their Windows VMs.

andrewgeller commented 2 years ago

@kbaley Can you provide any updates on progress?

Varorbc commented 2 years ago

I found that Sql Server and Raven 5 storage are available, but there is still no work to upgrade the target framework.

timbussmann commented 2 years ago

@Varorbc the work on upgrading to Raven 5 is still ongoing and hasn't shipped at this point.

We're actively working towards removing blockers to run ServiceControl on Linux, it's something we're definitely very eager to support. We appreciate your patience and fully understand (and absolutely share) your excitement about running ServiceControl on Linux/Linux containers.

mauroservienti commented 1 year ago

Status Update

We recently merged https://github.com/Particular/ServiceControl/pull/3118. That was the last step to add support for RavenDB 5 for audit instances. RavenDB 5 support will be soon released in 4.26. Documentation is all in place, and we are now smoke-testing the bits before the final release.

We have documentation covering the zero-downtime upgrade process.

What's next?

The next step is to provide the same RavenDB 5 support for primary instances (the ones storing failed messages). That will unlock upgrading the target framework, Asp.NET, and SignalR to versions enabling Linux container support.

mauroservienti commented 1 year ago

Status update

We just released ServiceControl 4.26.0. See the announcement here.

tedvanderveen commented 1 year ago

@mauroservienti can you share any ETA for cross-platform support / .Net 6 / Linux Docker?

boblangley commented 1 year ago

@tedvanderveen Unfortunately we cannot share an ETA. We will update this case as we complete the next steps toward cross-platform support.

RoderickIveans commented 1 year ago

It is sad that more than 3 years later, this issue still hasn't been solved

gl1tch commented 1 year ago

@boblangley if you can't share an ETA, at least what are the next steps to getting towards service control running on linux

kbaley commented 1 year ago

The plan remains as outlined here:

The next step is to provide the same RavenDB 5 support for primary instances (the ones storing failed messages). That will unlock upgrading the target framework, Asp.NET, and SignalR to versions enabling Linux container support.

The work to migrate primary instances to RavenDB 5 is about to get underway.

We are all very eager to have this work done. There's a good chance the internal collective sigh of relief that comes when we close this issue will affect weather patterns in certain parts of the world. And as you can see from the issue history, we are trying to be diligent in updating this issue for our progress. For me personally, it's not a good feeling to see people frustrated at our progress but all I can offer at this point is the knowledge that most of us want this done and are doing what we can to help everyone remove that one last VM from their environments.

Don't stop leaving comments, everyone. We'll keep chiming in when we have news to report on this.

mauroservienti commented 1 year ago

Status update

We recently released ServiceControl 4.31, which is another step toward containers support:

bh3605 commented 1 year ago

It's been two years since I've read this chain. We have service control running on Windows VM. It's been a pain in the butt. We hope to run a Linux Docker container to run service control on. In order to run on Linux you need to upgrade RavenDB to 5. Reorganized your docker deployments so they weren't so large. Now you're going to work on supporting RavenDB 5 which will allows us to run Service Control on Linux allowing a whole bunch of us to throw out our Window VMs.

What's left after adding RavenDB 5 support?

mikeminutillo commented 1 year ago

A collection of smaller refactorings to run it on .NET 6/8. Switching the version of SignalR and ASP.NET. Removing references to the registry and event log. RavenDB 5 support is the largest amount of work remaining and once we get that done we'll have a better view of what's left.

thoraj commented 1 year ago

It's been 18 months since my last input here.

But just to be clear; we are still waiting for this to be supported. The pain of not having ServiceControl in Linux containers is getting increasingly worse.

So naturally we will be very happy customers when this finally arrives 👍

tedvanderveen commented 1 year ago

Supporting Linux Containers for ServiceControl may take a dent out of the Particular Cloud offering business, as it makes things much easier to run things yourself in your own Cloud. But I assume that would have nothing to do with the implementation taking so long.

bh3605 commented 1 year ago

@tedvanderveen Where on their website do they talk about offering a cloud solution? https://particular.net/

mikeminutillo commented 1 year ago

Hi everyone,

You've been telling us you want to run ServiceControl in Linux containers, and while we’ve been working on that for quite some time, we haven't communicated our progress and status the way we should have.

All that changes now. We will do better.

Here’s a bit of the back-story: as it became clear that Microsoft wouldn’t sort out the Windows Container story as we expected, we started the migration efforts to a RavenDB version that could run on Linux a few years back. Unfortunately, this wasn’t successful since we encountered some blocking performance issues with the database and it also highlighted that ServiceControl had to be refactored to separate audit storage from the other storage needs for the migration to be successful. We worked with the RavenDB team to sort out the performance issues and by the time they were resolved, we also had been able to introduce the separate Audit instances for ServiceControl.

Here’s where we are now: as you may have seen, ServiceControl Audit instances now support RavenDB 5, enabling them to run on Linux. We’re currently working on updating the storage for ServiceControl Error instances to RavenDB 5, and we’ll soon finish replacing SignalR as the eventing mechanism between ServiceControl and ServicePulse.

Here's how to track our progress going forward: we want you to have a clear and clean place track our progress where you will be able to subscribe to notifications, which will only happen when actual progress has been made. To provide that, we've created a new "locked" roadmap issue which describes the work done and the work remaining, and we are closing this issue so that it's clear which is the one place people need to go to stay informed. We will leave a comment on the roadmap issue whenever progress is made, so be sure to subscribe to it. Please use the Particular Discussion Group for any questions, comments, or concerns.

When the work described in the roadmap issue is complete, we will have full support for Linux containers. Based on the remaining work we are 80% confident that we will be able to provide Linux container images for all types of ServiceControl instances in 7 to 10 months, which works out to somewhere in the first half of 2024.

As we hope you can see, we are close and this topic will remain our primary focus until it’s done.

From all of us here,
in Particular

bh3605 commented 1 year ago

Thank you very much! Your comment is great! Realizing this journey finally needed a clear roadmap for us to track is comforting to know you guys are serious about this and are taking the right steps to see it through.