Open sivakumargollu opened 7 years ago
ActiveGrid Application design to attain scalability.
In activegrid application site-service, scaling-service and workflow services have shared memory access. But the present development model is not addressing shared memory issue and the fencing problem(node-failure management.)
Feasible approaches to solve the shared-memory issue.
Each service is maintaining its own in-memory data for faster access. For instance, workflow service is holding current executing workflow services in map data structure. Each entry in the is a map a tuple of workflow-id and it's present execution context. Status of workflow is modifying according to its execution. While rewriting the application in scala, this status is being moved to neo4j database, Any change in the status of the application will be represented in database flag instead of in-memory value.
Above shows existing active grid development model with respect to workflow service. On single node execution, there seems to be no issue with above model in terms of consistency. But if the activegrid application is deployed on cluster across multiple nodes, it might lead to the redundant execution of the same task due to non-transactional access to data from neo4j. For example, Consistency in the execution of a workflow service is not guaranteed.i.e status of a workflow running in one server need not be essentially known by other servers, eventually any request to execute the same workflow will be started again
This problem can be solved with the neo4j transaction. With following assumptions.
OR
In this approach, each service can be viewed as independent service. All independent service(At least which needs special care) must be hosted individual machines. A proxy server or load balancing server intercept all incoming requests forward them to respective services.
Each service independent of remaining service. There should be the central database to access a common set of data.
Drawbacks.
Akka-cluster approach.
//Edit
This design address following issues.
1. Node failure management. There are multiple use-cases where node-failure have to be addressed. A.To maintain critical service request processing like auto-scaling,workflow-execution,site-creation, the request status will be maintained in Neo4j database. Each request to these services will be fixed with specific intervals of time. If the server failed before time-lapses status will change by the next request to the same service after checking the timestamp of the last request. The request will wait for the specific time if required.
B. If Neo4j database itself failed, Cluster should respond with failing status to all incoming request without proceeding further.
C. Any in progress request must be roll-backed or modified according to execution level.
Node failure must be notified to remaining participants of the cluster, If We maintain the Neo4j cluster with master-slave architecture, In the event of main neo4j server shutdown due to unexpected reasons then one of the slaves will be master and operation execution will proceed.
2. Operation status issue.
If half completed request led to the inconsistent status at Aws i.e while executing commands, While deploying the application, while executing terminal scripts then it must be processed again by one of the servers present in the cluster by restarting request process from the beginning.
3. Sharing data between multiple nodes. We will proceed with an Akka-cluster concept called CRDT to avoid data sharing issue.
Node failure managment.
With distributed data and load-balancing in cluster.
In activegrid application site-service, scaling-service and workflow services have shared memory access. But the present development model is not addressing shared memory issue and the fencing problem(node-failure management.) This issue is to look into design problems and to provide the best model to overcome existing flaws.