alfetahe / process-hub

Distributed processes manager and global process registry
GNU General Public License v3.0
61 stars 2 forks source link

Variations on the global theme #4

Open MarthinL opened 2 weeks ago

MarthinL commented 2 weeks ago

Hi,

Based on my fairly exotic use-case someone referred me to your project as being of possible interest.

Going by your descriptions it would seem in your frame of reference the term gobal roughly equates cluster as in the collection of linked instances of the Erlang VM running on (hopefully) seperate (LAN connected) hosts that are allowed to call on each other because they have the same cookie value.

My use-case differs from yours (I think) in that my global is the multitude of geographically dispersed clusters of nodes my app runs orchestrated by generic Kubernetes. The purpose is to give the app requisite capacity by spreading the network access, business logic and database access around to be as close to the end-user as possible.

I've designed my application's database from the ground up to be regionally distributed. I handle replication and updates required at the applicatin level, meaning each cluster runs a local HA database with some additional read replicas but the different databases do not talk to each other directly, only via the app.

For the application to contact its peers via each other's public IP routed to a HTTP API is possible provided I can properly secure the connections with mutual TLS. But I'm looking at potentially better options whereby the inter-cluster calls can happen at the Erlang/Elixir level rather than via HTTP. I'm really looking for the simplest abstraction possible. I read the region id for each record a user retrieves directly from the database so I know which region to request it from and I'd like to make it a single call to some smart agent to fetch the data I need from their master regions.

When the're just one region per continent network latency and bandwidth already clashes with the LAN-tyle assumptions made by Erlang VM style clustering which by default is fully meshed. When those continents each break down into several hundred regions, the full mesh approach collapses under the weight of its own overheads. Technically though any region could request data from any other region, which suggests full mesh functionality even when it doesn't really implement a full mesh.

I'd hope to implement a different topology such as a hub-and-spoke topology or a sparse mesh based on what requests are actually flowing between regions in real time, like one would manage connection pools or cache, but contain all that complexity behind a simple inter-cluster/region request for data or updates getting sent out.

I'm reasonably certain that my use-case and the approach I'm favouring at the moment falls outside the current scope and vision of your project. Would you have any interest in expanding your vision and project in that direction? Would you let me help you implement it or help me understand how you've put your project together so I can build what I need as a fork or a PR on top of what you've done? Or is it just too far outside your scope to consider?

-- Thanks for your time -- Marthin Laubscher

P.S. Together or separate, it's very likely whoever ventures in the direction I'm suggesting would seriously consider using Partisan in the process.

sleipnir commented 2 weeks ago

Hello @MarthinL, I saw the description of your problem and I couldn't help but comment just out of curiosity that I implemented a topology very similar to your description in the Spawn project.

Basically what we do is define a grouping, which we call ActorSystem, and within this grouping there can be N connected Erlang nodes. Communication within the cluster works using Erlang Dist normally. But when you want to communicate with a different grouping/actor system we forward a message via Nats broker to this grouping. The receiving cluster in turn will forward the message via Erlang Dist within its control region.

We do this in a transparent way for the user calling the Actor/Process. It doesn't need to know if the message was forwarded via Erlang Dist, Nats or any other way.

We use Horde to control the cluster, but we only use the distributed registry and not the supervisor as we have some affinity rules that need to be respected internally. So I'm looking at this excelent ProcessHub library to see if it could replace our internal Horde-based cluster implementation.

Maybe this could help in some way. I would also be happy if you took a look at our Spawn project and see if it would help you with anything. Or you may have some insight from these ideas to see how to proceed with your problem.

alfetahe commented 2 weeks ago

Hello!

Thank you for your interest!

Currently, I believe the library won't perform well in scenarios where nodes are geolocated around the world. There are some issues that need to be addressed, such as configurable timeouts or smart timeouts (which adjust themselves based on the current situation), and the system should handle situations better when messages are lost in communication.

As you mentioned, Erlang Distribution by default uses full mesh topology, which is another hurdle to overcome with very large clusters, especially if nodes are geolocated around the world. To address such problems, I was thinking along similar lines as @sleipnir. We could have smaller "hubs" in each geolocation, each with its own unique hub_id, meaning it only controls processes within that hub and not other hubs. All hubs could then be connected using hidden nodes between them to create a communication channel between each regional hub. This way, we could avoid fully meshed connections, though it might introduce other issues.

This is just me thinking out loud :) I am open to other ways of addressing these problems as well. I'm interested in trying out ProcessHub-supported process management on Kubernetes-hosted clusters too if that makes any difference.

What ideas do you have regarding ProcessHub anyways? I can explain the code itself for you, but please keep in mind that it may still change a lot over time.

Even if your ideas are outside the scope of the ProcessHub library, I would still be interested in leaving room for integrations with other possible libraries.

MarthinL commented 2 weeks ago

Ah yes, I think a deep and meaningful discussion will benefit us all though this (issue thread on Github) might not be ideal. Nevertheless, it's what we have right now so here goes...

Since reaching out I've kept looking at various tools and libraries, what each bring to the party and what it would take to find alignment between them and my application. The frustrating part of that is that every library and initiative seems to be on its own mission to gain popularity by offering feature rich solutions to the widest possible audience. Even if each library remains relatively straight forward, the combinations of libraries one might need are not aligned so they end up addressing similar issues but in different often incompatible ways.

In the broadest of terms I'm leaning away from anything that introduces their own complexities towards things that takes isolates and encapsulates real-world complications in a simple way. I don't really know what exact real world problem your library simplifies for which actual users and use-cases.

It is my personal experience and viewpoint that the complexity distributed systems are famous is the result of choosing the wrong tools for the job. Not only the wrong platform, language and frameworks but also attempts to use distributed systems to solve centralised or monolythic problems. Yes, distributed systems can be powerful, but unless the problem you're designing a solution for is fundamentally a clearly defined distributed problem trying to build a distributed solution leads to a major disconnect between the problem space and the solution space.

Let me use my use-case as an example with the disclaimer that I have no idea how many if any others face the same use case, nor can I afford to care. I have a problem to solve with not even a fraction of the usual resources that others solving similar problems have.

I'm building an application for the world's people to use. I'm starting out with four donated servers fom 2012, two fibre links and a really powerful concept I've been refining and redefining over three decades.

Logically my application sets up a single global database of related content and gives every user an ever-changing and highly personalised view of that content.

Physically it's a completely different matter, in the following ways:

1) The world's people are geographically spread out over (almost) the entire globe, but 2) the closer to the user the server is (in terms of network) the better the user experience is, 3) the better the user experience the better my chances of success, and 4) the regulatory conditions pertaining to where their data may or must be stored varies as well.

The amount of stress a single user can place on the system as a whole can be limited by having a user's LiveView session served from the same cluster and even the same instance unless something like a failure or outage prevents it.

Even more significantly is the realisation that the data being stored a regionalised as well. A select portion of a region's data will end up getting referenced in other regions but I can neither afford nor allow all data to be replicated to all regions.

I have chosen therefore to regard maintaining a cost-effective alignment between where my application runs and how it stores and retreives data and the reality of how its users and their data are distributed as an application level concern. I.e. I consider distribution as a core concern which I don't outsource to something like a distributed database (e.g. YugabyteDB) or a cloud provider (e.g. AWS). This is where I suspect most would believe their use-cases and situations to be different to mine as I often see application designers shying away from doing anything they consider complex if they can make it someone else's problem. My issue with handing off core elements of your application to third parties is that it usually means that while you thought you're solving one provlem you now have two additional problems on top of your original problem - the complexity of adopting a new foreign toolset and keeping track of how your core requirements maps onto the facilities of this external tool. In my experience this invariably leads to a net increase in complexity you're having to deal with rather than a net reduction of complexity.

Even then there's two ways of reducing complexity. You can ignore complexity (i.e. do things like make simplifying assumptions, use some 80/20 rule or limit the list of things to consider to some number or any of the many similar games people play) or you can dig in properly and wrap your head around all of the complications good and proper until you find a way to encapsulate all of the complexity behind a clean and simple abstraction. I'm obviously favouring the latter, but it's really hard work and rarely actually achieved in projects. The trouble with the former approach is that invariably the things that gets ignored or de-prioritised are exactly the least understood areas which is all too often exactly where the actual trouble is going to come from. It's a lot like physics textbook problems telling you assume a frictionless environment when in reality the most pressing problems stems from friction and its variability.

What does all or even any of this mean to us? My best answer would be "depends". I'm compelled by circumstance to focus on solving a problem that has in general terms already been through the simplification process. My distributed system isn't complex at all but almost trivially simple and I'm loath to make it complex again by looking for answers in all the wrong places.

My initial assumption that I would be able to avoid geographic distribtion in practice for long enough for the disterl approach to adapt to WAN realities was wrong. Either disterl is never going to suit intercontinental WAN environments or I got to where it's needed too quickly. Either way I cannot use what comes in the box as is and I need to make a plan.

My application is designed so clusters are free to request data from any other cluster, i.e. logically a fully meshed topology between clusters, which is challenging to achieve from physical, practical and cost efficacy perspectives. That translates into the need for a way I can still allow my clusters to request data from any other cluster as required and know that it's getting the results in the most expedient way possible at that moment. The code that requests data from other regions should not be clouded with issues of security, network latency, stability or capacity or which of the other regions' nodes are up or down. The remote request module should encapsulate all those complications and more to simply deliver the results.

I say "and more" because the distributed system at any stage of deployment and maintenance as well the network links that interconnect regions all have resource limitations resulting both from what resources have been configured, what is currently in service and how how much capacity each involved resource has left at that moment.

I can see that in toolsets such as service mesh implementations the notion of service discovery gets a lot of air time but I don't consider that of much significance to me. I know in advance exactly what services are available in any region where my application is deployed, and I have a 100% reliable way of identifying from which region I must request data I don't hold myself. What I do need is to get a current copy of the region directory/registry/database to every region so they can decode a region id into the connection details for the cluster serving that region.

I'm making the assumption that when a request for data is being put together it uses the region id as address. Resolving the id to the relevant network address or pid or name or whatever else is required is done inside the library, making the library the primary user and therefore probably the owner of the region data. If it's an external library this would require some form of data exchange between the application where regions are defiend and the library where regions are resolved into network addresses based on additional data like network link capacities, latencies and region queue lengths.

The region registry with all its persistent and transient data is required to behave very differently from the main regionalised database. Specifically, the requirements of the region registry is much more in line with classical database replication principles. It does not need to be implemented as such, but it may be useful to recognise the requirements as more mundane and possibly use off-the-shelf tools for that purpose. The regions registry's data and functionality can be considered analogous to a control plane in service mesh terminology while the requests and updates passed between regions are analogous to its data plane.

When we look inside the library's white box of "input of a set of requested ids (which includes region id) with output either once-off retrieved records or cachable records with updates if they change and control by the list of participating regions and their contact details) I've indeed had some ideas about what I'd like to see happen.

First of all I'd like to add that I've noticed a common feature amongst works in this domain is the conflation of a node or cluster being out of service with the node being unreachable from one or more other nodes because of a network issue. Once against I don't consider that invalid at all, in fact, for most general purpose load-balancing situations there may be no point to distinguising between the two conditions. It's just not exactly aligned with my reality. My regions are not dispensible replicas that if one cannot be reached it's just as good or even better to get the required service from another region. Each region is master to a partition of the global database. If it's unavailable for any reason, its data cannot (usually) be sourced elsewhere. But the region is deployed in clusters which renders it unlikely for the whole region to be down. But the links into that region, since they traverse the internet, can be all kinds of slow or down without any notice. Once again thre are ways to mitigate most of the risks using redundancy, but having explicit redundant links everywhere creates not only an often unwarranted cost but also sets up a major planning and operational workload which either works brilliantly because it's conciensciously run by highly skilled and motivated engineers or which fails at the worst possible moments because who can afford such a dedicated team of engineers and somehow keep them entertained/engaged enough, right?

Rather than making the mesh network a statically designed and maintained engineering concern, I vote to piggyback on the Internet which is already being engineered and maintained as economically viably as can be as a point of departure. There will always be exceptions which one can handle, but for the most part I am happy with the situation where my clusters across the world connects via the internet. I won't eant to stop there, but it's a good start.

The next step would be to monitor traffic characteristics between clusters with the objective of having data at hand in each region whereby it would have an expected turn-around time for every cluster calling another cluster direcly. It's not only the network latency that affects turn-around time but also how long the request will be queued for before the serving and intermediate nodes can get around to processing it.

Because setting up a connection and breaking it down takes time it would be a major contributor to end-to-end latency if we were to resort to using a fresh connection each time we have a message to pass in order to save on the overhead of having all clusters permanently connected and exchanging heartbeats as per disterl practice. We know we (our infrastructure) cannot afford a permanent mesh topology and our application cannot afford the overhead of making and breaking connections each time a message needs to be passed. The solution is kinda obvious given proper seperation of concerns has been done. The number of active inter-cluster connections we're happy to keep alive can be configured. After that we'd manage these open pipes like one would manage a cache. If we can quantify how long it takes to establish a new connection, use it and tear it down again, we know the most recently observed latency of each direct path and we have a way to know if the target cluster and even more so all other clusters that might be able to relay the message is likely to process it without delay, we can work out the cost for different paths and choose the least costly one. If that means bouncing off multiple servers around the world to get to a node that's close but impacted by some network outage on the direct path, we still get to the node we need to get to albeit a little slower than usual, but all's well and we'll resume using the normal route once the outage has been resolved.

There are numerous network level protocols for doing something similar to this and you're well within your rights to wonder why I'd not jump at the opportunity to make least cost routing the network's problem. It's a multi-part answer. Firstly it's only when you engage the upper echelons of networking equipment that you might have any chance at such tight integration between the routing and application layers of the networking stack to even start a discussion about making the network sufficiently aware of the applications's specific opportunities and concerns. At the application layer (written in a suitable manner) it is almost trivial to determine a current queue length and/or track the most recent end-to-end latency measured in a conversation between two endpoints involving your own application. The moment you try to do something like that across layer boundaries and especially using protocols and interfaces designed to be a broadly applicable as possible, you're in for far more complexity than you can hope to avoid by using those so-called tried and tested tecniques. None of the work that needs to be done and none of the decisions that needs to be made are particularly challenging. What the big network guys are getting paid for isn't those small little calculations and decisions but setting up the standards-based ecosystem that exposes the capacity to make those decsions in the most general terms possible so everyone has equal access to them. We're free of those burdens so we can take the tiny bits of "cleverness" of least cost routing and attach our own massively simplified opportunistic data to it with a minimum of fuss. There's even a choice of libraries in Erlang to do the actual shortest path calculation through a given graph for us.

So the idea would be to derive from the data, either by explicit nomination of by letting the data speak for itself, a hub-and-spoke based "backbone" for the control plane. The hubs would likely end up spread around the continents and for larger continents possibly spread across the geographical area. It's not a given that the hub would be colocated with the biggest or busiest cluster they serve, but it's likely. The best hubs would have minimal latency to the most connected spokes and the sapre computing and network capacity to relay ad-hoc connections. Once the hubs and their spokes are established they'd use the control-plane to establish a Pub/Sub mechanism for the Region Registry.

It might help at this point to know that regions in my application are themselves structured hierarchically. They're also designed to get divided and redivided to cover smaller and smaller geographical areas with high enough traffic to warrant their own clusters. I said earlier that the region database could be fodder for a traditionally replicated database but the most refined design would be that the control plane follows the regional hierarchy to incorporate new regions as they get rolled out. The region database has to parts - a) a largely static section containing identity and contact data for the region which originates from the configuration files of the region as it is commissioned, and b) the possibly fast changing measurements made from the perspective of each region as to its own queue length, the last known latency to each node it was able to measure and (unless it gets conflated into the latency figure) which nodes it has open (cached) connections to. Ideally the tools used to implement the region registry would be able to seize the opportunity created by the fact that while the static identity data would have a logical "root" region to serve as its data master, each region has its own partition of dynamic data it would uniquely master - The World According to Garp, or in this case the measurements according to each region. By using some Pub/Sub model it should be easy to arrange for:

a) New regions to (using config data) register at the root region as a child of (also from config data) another region and as having a specific (also fom config data) region id. The root region would (be prepared by the rollout project) validate (to avoid mistakes and duplicates) and then publish the details of the new region to all subscribers.

b) All regions to do their own measurements and publish changes.

c) On a slow update cycle (i.e. nothing remotely as frequent as disterl heartbeats) the regions would fill in their local databases with "most recently observed" cost driver values for all regions.

A data-plane refinement of my own design might end up having value to this control plane as well. I initially got quite excited when I noticed that Partisan uses gossip because I designed a gossiping network (a loose translation from my native Afrikaans) in the '80s as a varsity project using aimed at using two-way radios and I was keen on seeing what the old term was getting used to in the modern context. Through the documentation I located the epidemic message trees academic paper gossip implements and that's where I noticed that the way I had intended to resurrect the gossip idea in my application's context and what the gossip library actually does are quite different. No big surprise there, but it does provide some context. The main points on which we differ is basically that epidemic message trees picks a configurable number of neighbours to gossip with at random and then ignores any potential duplicates. My approach also pick as configurable number of targets to send to, but not at random at all. It picks say the n nearest neighbours and the m furthest (yet reachable) neighours to send to. It also uses a speclaised library I've written as a PostgreSQL (extension written in C) that implements a type, aggregate and casts. Using this allows me to know which regions have subscribed to what and far more importantly to keep track at every hop which regions still needs to be notified. That way I can minimise both traffic and latency, maximise resilience in the face of transient outages and spread the load across all the impacted nodes but not the others. Every node gets a predetermined number (usually 2) copies of each notice. This is some advanced stuff that may or may not make sense for the small data volumes in the control plane. In the data plane it is critical and an integral part of why I am managing data distribution at the application layer.

Where does all that get us to? I'm not quite sure. I just dumped my mind in the hope of triggering something. I think it's increasingly safe to assume it's heading in the direction of a new purpose built library that reuses some of the work you or others have done before but possibly not even. It's risky to assume anything about your level of interest in getting involved at all, but if you are considerint it, let me add this. I don't know who your existing project has as anchor tenants or even if ou have the benefit of real-life use cases for what your libary seeks to offer. At the very least I am offering you a genuine real-life use-case for a problem I'm irrevocably committed to complete as soon as possible and where you'd find no lack of certainty about what needs to be done, why, and within what constraints. I reiterate my inability to predict how many others have the sact same use case I am presenting, but at the same time I can say without much fear of contradiction that once this library is in place, given its clarity, elegance, simplicity and power it is very likely that it would inspire many others caught in the fog of harnessing the potential of distributed systems to find their feet and start getting somewhere.

I'm going to have to do a version of it anyway, for myself. If you're on board, we can do it together and the result will be much better documented and easier for other to derive value from than the version that will be hidden inside my application source. It's your choice to be involved or not, and you're completely welcome to choose either way. I just need to know farily soon so I can plan my end of things accordingly.

MarthinL commented 2 weeks ago

One more thing. I'm fixing on launching my app soon but the initial version will run the first few regions on the same LAN environment which keeps disterl on the table for the initial release. My plans are to introduce the abstraction now already with a trivial backend that maps straight onto disterl primitives. That would give me or us a stable definition to work against when we start taking things on board like mTLS, hidden node hierarchies, pooling, relays, least cost routing etc.

sleipnir commented 2 weeks ago

@MarthinL Excellent explanation of your use case and excellent analysis of your problem.

I was just wondering whether you agreed or disagreed with what we had suggested. Because I saw many similarities in what we said, without the same technical rigor (I don't think that was our intention anyway) and what you ended up describing.

For example, we said that a lot of it was about establishing limits on what was regional and what wasn't, and that if this was identifiable to the user then it would be simple to delegate the forwarding of the request to a non-mesh message passing mechanism, so that it was treated by the other region. At this point @alfetahe and I gave different implementation proposals, a pubsub or hidden nodes mechanism. Personally, I believe that hidden nodes would not work transparently over WAN, while pubsub mechanisms would be easier to implement over WAN.

Anyway, excellent discussion and if you are going to build something let me know as this is an area of ​​great interest on my part.

MarthinL commented 2 weeks ago

@sleipnir Yeah,

I was just wondering whether you agreed or disagreed with what we had suggested. Because I saw many similarities in what we said, without the same technical rigor (I don't think that was our intention anyway) and what you ended up describing.

My intention wasn't to agree or disagree either, not yet anyway. My intent was and remains as stated to start a deep and meaningful conversation about how to frame or reframe the problem and as a result the solution. Only once we're on the exact same (and most likely new) page about how whatever solution we end up with fits into the context of a real-world distributed application can we start discussing technical details such as the validity and suitability of various options.

To contextualise the above in terms of your example. You said a lot of it was about estblishing limits on what what regional and what wasn't. That's neither untrue nor helpful. A designer or team considering building a distributed solutions make nost of their mistakes in exactly that area and a lot of the reasons why has to do with distributed tools and frameworks aiming to be as generally applicable as they can and in the process offer no guidance either. The position I'm taking contradicts that approach saying that the way the solution delineates dsitribution boundaries should be be front and centrer in the problem domain or else it will end in tears.

But then I go quite a bit further saying that from at least one real-life application's perspective that crucial real-world distribution delineator is the geographic distribution of people around the globe impacting on the network latency between their browsers/devices and the application's closest point of presence. I'd never claim any other application would have cause to solve the exact overall problem my app is aimed at solving, but I do think that firstly once it has been shown as a living example how a real-life problem such as where users find themselves in the world would be drawn all the way from reality into exactly what drives how the application distributes itself it will be a lot easier for others to treat their real-life distribution issues the same way. (Secondly) I also think that a well defined solution for distributing an application in line with the how its users are distributed geographically might well lead to a fair few application teams recognising that that they are facing the exact same real-life distribution issue to the extent that they don't even have to bring their own delineation factor but use the library as is.

At this point @alfetahe and I gave different implementation proposals, a pubsub or hidden nodes mechanism. Personally, I believe that hidden nodes would not work transparently over WAN, while pubsub mechanisms would be easier to implement over WAN.

We're not even properly brainstorming options yet, but if we were, your differing suggestions would both have to be considered along with others, each hopefully offering a way towards the overall goal. We'd do as we were taught, and formulate the best solution by drawing on all those ideas. But first we have to have clarity on what we're aiming to do.

I agree that hidden nodes isn't suitable for WAN. Actually as far as I can see, the entire disterl communication layer is limited to a LAN environment at best because the absense of encryption and the use of multicast makes it so. It is almost a given that to cross cluster (LAN) boundaries we will require a different transport. There are some established patterns for that in the form of Partisan and Submariner/Lighthouse, even in Istio and Linkerd, but like I've said each of those comes at a price (complexity, conflicting concepts, misaligned purposes, baggage built up in support of native environments with far less inherent support for distributed processing than Erlang/Elixir) that outweighs their value so we might end up taking the simplified version of that they do and make it so for ourselves.

Thank you for participating in the conversation.

I'm especially open to suggestions as to a more suitable forum for this discussion than comments on an issue reported on a Github project. I remain hopeful that without me insisting on it that starting a fresh project focussed on this would emerge as everyone's preferred way forward. If I end up doing it alone I won't make it a separate project on Github but if we're going to all work on it together the proven way to do it is a project on Github. The only real question is whether or not this project has enough in common with the ProcessHub project to qualify as iether a sub-project or a new branch of the project or it would be better to see it as a standalone project?

Comments and feedback please.

alfetahe commented 2 weeks ago

@MarthinL , first of all, great analysis on your part!

It seems that you are thinking more broadly, whereas ProcessHub is primarily focused on the distribution of processes. Your use case likely requires the distribution of various types of data, not just processes.

Currently, I don't have any major plans for extra functionality—not because I don't want to, but because the existing functionality already satisfies my project needs. However, I would love to see the library evolve into something even bigger.

While I am interested in extending the library for a wider range of distributed use cases, my current priority is to enhance the stability of the existing functionality. However, you are welcome to submit any pull requests or create issues on GitHub if you would like to add new features to the ProcessHub library.

I will also update the guides and provide a more detailed explanation of the overall architecture in the near future.

Additionally, I recommend looking into riak-core, as it may be useful for your needs.

MarthinL commented 1 week ago

@alfetahe, thanks for your kindness.

I've seen Riak and the misaligment was fairly obvious to me, but I'll have another look at the core part on its own to see if there's some alignment there. The primary misalignment isn't what you'd expect, i.e. that it's KV/NoSQL based or exposes my app to a whole new class of risks regarding data storage, backups and maintenance operations at scale. No, the main thing I found last time round was right there in their own summary of what Riak aims for (and I quote)

Riak KV is a distributed NoSQL database designed to deliver maximum data availability by distributing data across multiple servers. As long as your Riak KV client can reach one Riak server, it should be able to write data.

which means their fundamental drivers are availability and load balancing across nodes and/or clusters that logiclly hold the same data and offer the same capabilities. Their fundamentals might suit some, even many, I'm sure, but it doesn't suit me. The big point of departure being that my regions are not each others' equals but each hold a unique subset of the total database. Similar concepts can be achieved with advanced configuration involving xCluster in YugabyteDB but it's right at the limits of what it can do and the baggage that one picks up from adopting that entire eco-system to support a single application adds up to a net loss.

Currently, I don't have any major plans for extra functionality—not because I don't want to, but because the existing functionality already satisfies my project needs. However, I would love to see the library evolve into something even bigger.

I wouldn't mind learning more about your project and the needs ProcessHub solves for it. I've been fascinated of late by the variance in how distributed system approaches are getting applied in practice. I don't often see instances where people have had positive results from their attempts to spread large linear workloads over multiple machines and even fewer who had success over multiple WAN-connected clusters. To date, the only successes of the latter kind I've seen evidence of had been where the fundamental distribution is already present in the workload itself like in my case where the workload of serving a web-based user is relatively small the problem is that there are so many of them needing service at the same time and their geographic spread means no place on earth is equally or even suitably close to all users.

Anyway, I think what I'll do is to formulate the facility I need specifically as a separate concern so that the interface between the Phoenix/Ecto/LiveView application and the inter-cluster communication mechanism can be clearly defined. It's good to mainain clear lines between separate concerns so it's not wasted effort to do that. Then I'll take that spec and write a trivial implementation of it using the relevnt parts of disterl using libcluster provides for. I.e. an initial version which defines the facility and how the application would specify what the nodes are and what the cost and limitations of each direct "link" between nodes are.

I'll then put that as a separate repository on GitHub and invite people like yourself to consider bringing your insights and experiences from the work you've done into that project, or not. If nobody finds cause to join in and contribute, so be it, then the project will be either still-born or dormant until some time in the future when I am done with my solo implementation and can publish it as something that could add value to others.

How does that sound?

alfetahe commented 1 week ago

I'll then put that as a separate repository on GitHub and invite people like yourself to consider bringing your insights and experiences from the work you've done into that project, or not. If nobody finds cause to join in and contribute, so be it, then the project will be either still-born or dormant until some time in the future when I am done with my solo implementation and can publish it as something that could add value to others.

That sounds good!