dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Orleans best practices for production usage #6178

Closed pherbel closed 2 years ago

pherbel commented 4 years ago

I would like to collect all the practices of production usage of Orleans. In Orleans documentation and samples you can find some of these topics and practices, but I'm sure all the information would be great in one place. I hope Microsoft internal teams and community together can create a recommendation how Orleans can run in production very well.

I love to create documentation for this.

Please cc all the parties who can help us and have production experience.

Some topics for start, but I'm sure there are missing topics.

Silo vs Client specific topics

Orleans streams topics

Standalone deployments Cluster orchestration deployments

Cloud provider specific things

pherbel commented 4 years ago

@ReubenBond, @sergeybykov Please help me to bring people who has knowledge for production usage. Thanks

jkonecki commented 4 years ago

Happy to help - I've been running Orleans in production for over a year now.

rvplauborg commented 4 years ago

Would be great with an example of migration of state in a real setup with some given storage provider, if anyone had the time and experience too :)

pherbel commented 4 years ago

@jkonecki That's great! Could you please describe your setup with go through on the topics above? Did you have any production issue what you had to solve or any custom solution? Thanks

pherbel commented 4 years ago

@rvplauborg Thanks. State Migration is very important topic so I added to the list above.

daniellm commented 4 years ago

Count me in. If you can, be more specific in your questions. We've been running micro-services based on Orleans for a few years now.

zh6335901 commented 4 years ago

Who can provide some help and experience for distributed tracing. I want to trace my Orleans application by OpenTracing/OpenTelemetry, But I don't know how to start. Very thanks : )

pherbel commented 4 years ago

@daniellm Great! Thanks! We appreciate all your help. First of all I would like to use this issue to collect all the production knowledge and experience then the community can use this as a main source that how Orleans can run in production.

So I hope you can describe your system with the topics above. I will do it too about our system and maybe it can be a sample. But any suggestions are very welcome.

pherbel commented 4 years ago

@zh6335901 Yes this is one of the topic what we want to cover. As I know @Ulriksen has quite good solution for distributed tracing with Application Insights in their production system.

@Ulriksen Could you please help us Also would be great to see your system here with more details. (What you presented in your NDC talk) Especially interesting because of the K8s hosting and distributed tracing topics

pherbel commented 4 years ago

Here is our production system and thoughts

Desc: Location intelligent platform which is processing GPS coordinates and doing reactive decisions like road network matching, geofencing, etc. Version: 2.x Platform: Azure - Service Fabric Cluster Architecture: Actually it is very simple. We have 3 SF service in SF application.

(HTTPS)-->WebAPI -->(Orleans TCP)-->OrleansHost-->(HTTP)-->LocationService

WebAPI: HTTP REST endpoint (https 443), internet public

OrleansHost: Standard Orleans TCP endpoints, SF vnet private

LocationService: HTTP REST endpoint (http, 80), SF vnet private

Hosting We use Service Fabric for cluster management

It works great and I think cluster orchestration hosting should be one of the production recommendation. It has good benefits for Orleans hosting.

We decided to go with several smaller servers rather than the big ones, because SF give you better reliability option if you use more servers. After all Orleans has very good performance characteristic so latency is good.

We decided to host all service on all SF nodes and this setup works well for us.

Consideration: With this setup the Co-Hosting (WebAPI and SiloHost in same process) option could be better if you don't do any special in HTTP service or serve files.

CI/CD: We use Azure DevOps and it works very well with SF. Deployment goes one-by-one

I think one-by-one deployment should be a production recommendation, but alongside with graceful shutdown practices. I'm wondering how others are doing this.

Clustering: We use Azure Storage Clustering We use dedicated "system" storage account for Orleans clustering and reminders,etc

I think Azure Storage clustering is also a could be a production recommendation and as I know it is used by multiple production system.

Monitoring & Logging & Tracing: We use Azure Application Insights and Orleans Dashboard

I think AI could be production recommendation Considerations: Would be great a native AI support for Orleans as asp.net. Live Metrics, Application Map with grains, etc Somehow Orleans Dashboard functionality on AI dashboard. Also Orleans Dashboard is great too and we love it, but best would be one solution for this. And we are not happy with co-hosted Dashboard. Also we have option to send log message into file on SF nodes for debugging, but it is switched off on production by default and used very carefully

We use asp.net metrics and also have some custom metrics from Orleans grains. These are also used for thresholds and alerts.

More in the next comment

pherbel commented 4 years ago

Next part

State & Storage provider: We use Azure Storage blob for grain states

We made a mistake that we use Orleans serialization for grain states. It is not flexible, not human readable, no option for external migration I think Json serialization should be the default and production recommendation with Blob and Table storage persistence.

We made a state migration at some point and it was really a pain. We deployed a new version with specific migration code in grain and we had a migration manager who triggered the migration in batches. After the migration completed we removed the code and deployed new logic.

Would be good to find out the best approach for migration. Also I would like to find the way for state archive support.

Authentication, Authorization All AuthN and AuthZ handled on WebAPI surface with Asp.net. We don't delegate security checks into Orleans grains, but we do some check for data consistency.

Security: We use HTTPS on Public WebAPI endpoints and use SF certificate management. All private communication is running inside the SF private VNET. There is no secure communication (2.x)

I think if the cluster has other application or not fully private for Orleans communication than the secure communication is must have and should be production recommendation. (TLS 3.0) Otherwise all Orleans communication should be private on secure VNET

Wondering how other people think about Orleans security

Serialization: We use Orleans default serialization for Throttling: We have custom Orleans based throttling which is works great. We planning to open source this some form.

Orleans streams: We started the system implementation with Orleans EventHub and Memory streams, but after some tests and design change we removed.

I can deep dive in some topics if someone needs. I will give some details about challenges and lessons learned topics in other comment.

daniellm commented 4 years ago

Here's our setup:

Description: Platform for storing and managing user accounts on behalf of large websites, handling various login and authentication methods, and more.

Orleans, .Net, OS: We're transitioning from Orleans 1.3 & .Net 4.5.1 on Windows to Orleans 2.x & .Net 4.7.2 on Windows. We're preparing the next transition to Orleans 3.x with .Net Core over Linux.

Platform: Using providers such as AWS and Alibaba for IAAS but not depending on any PAAS/SAAS.

Architecture: We use micro-services, most of which employ Orleans, each in its own independent cluster. Developers are able to build and deploy their services independently. Each micro-service exposes an interface as a nuget package that its clients consume and use to issue RPC calls. Micro-services discover each other's nodes using Consul. We developed a home-brewed framework for microservices that handles discovery, RPC, response caching, distributed tracing, metrics, health checks, configurations, and more. It's called "microdot" and is open-source.

Orleans-based microservices define a stateless, re-entrant worker grain that implements the service interface. Our framework translates incoming Json-over-Http requests into calls to these grains. Service grains typically call internal statefull grains to perform business logic.

We have an API Gateway that routes traffic to our services, after performing authentication, authorization, throttling, geo-blocking, and more. Microservices expose metadata about their endpoints which the API gateway uses to match incoming requests to the respective service, without having a strong dependency on their interfaces.

Hosting: We use in-house orchestration tools to deploy and run our services, that use Nomad and Consul, though we're evaluating switching to Dockers and Kubernetes.

Monitoring, Logging, Tracing: We log all calls between our microservices and between grains inside microservices along with tracing information and ship it to Elastic Search in Logstash format. We use Kibana to inspect how an API call flowed across services. We use Kibana and Grafana to create dashboards and define alerts. We sometimes use the Orleans dashboard and Metrics.Net to inspect a single node.

State persistence: We use a home-brewed big-data database that runs on top of HBase and Elastic Search and adds features such as transactions, replications and fail-over clusters. Grains save their state using json over REST. Each service has its own namespace in the DB. Schema changes are handled mostly by having the state models being future and backwards compatible, which is easy to achieve due to the flexible json serialization. Schema changes usually happen over two deployment cycles, or by using exotic json.net features such as JsonExtensionData that enables rolling back a service and retaining newer data in a weakly-typed dictionary.

pherbel commented 4 years ago

@daniellm Thanks! That looks a quite complex framework.

Do you use Consul for Orleans Clustering tool? Why are you evaluating switching to K8s?

What do you consider as production patterns and practices what you recommend to others?

ifle commented 4 years ago

@pherbel @daniellm Thanks, very instesting

Orleans streams: We started the system implementation with Orleans EventHub and Memory streams, but after some tests and design change we removed.

I can deep dive in some topics if someone needs. I will give some details about challenges and lessons learned topics in other comment.

@pherbel Can you please give more details about challenges and lessons?

ifle commented 4 years ago

We decided to go with several smaller servers rather than the big ones

@pherbel Can you please give more details about using of SF and size of nodes. Thanks

KarenTazayan commented 4 years ago

Hello @pherbel,

For Authentication, Authorization we are using this library. It isn't battle-tested yet, but I think it is ready to use in production with IdentityServer4. I'm working to perform more integration tests.

Ulriksen commented 4 years ago

I have documentet our deployment to kubernetes here https://blog.ulriksen.net/deploying-orleans-to-kubernetes/ tl;dr; Continuous deployment, Azure Kubernetes Service, Rolling deploy.

EugeneKrapivin commented 4 years ago

@pherbel I see that @daniellm probably missed your questions I'll try to answer them (we are working together): We aren't using Consul for Orleans clustering, we are using ZooKeeper (might rethink this decision in the future though) As for the question of moving over to k8s it has 2 parts, the first is financial - Linux servers are cheaper than Windows as you may know. The second is to allow us more a robust deployment, scaling and orchestration schemes than currently possible.

pherbel commented 4 years ago

@EugeneKrapivin I see, Thanks

@Ulriksen Thanks for sharing!

@KarenTazayan Looks interesting. I will take a look at it. Thanks

pherbel commented 4 years ago

@ifle Service Fabric has quite details documentation for capacity planning and reliability levels.

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-best-practices-capacity-scaling https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-capacity

In nutshell In our solution, we used 7 VM (2 CPU core, 7GB RAM) We chose the smaller VM size because it offers a better reliability option from the SF side and Orleans side too.

But you have to test your application's CPU and memory characteristics, and you choose VM types for that. Orleans can handle quite a significant load with low latency on a few VMs, so you have to be careful if you want to be cost-effective.

turowicz commented 3 years ago

We need a good Prometheus package for Orleans.

turowicz commented 3 years ago

@pherbel we also need stream providers covered.

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.