Support HA for RMS deployment

jgirase commented 3 years ago

Currently with RMS helm deployment , there is no configuration for replicas. We can add it for high availability of RMS server or cluster of RMS pods.

LakshmiMekala commented 3 years ago

Fixed with PR #194.

@jgirase validate the fix

jgirase commented 3 years ago

@vpatil-tibco We tried with more than one replica for RMS deployment, following are the observations which needs attention,

1.When Replica=2, there are 2 pods are running for RMS and one service which is expected. However when we login to webstudio, user gets logged out automatically with error as "Invalid API Key [workspace]" in one of rms pod logs. We tried login again and could see weird issues in browser and it seems only one pod is accepting the requests. In this case both pods are running independently without connecting to each other.

2.With Replica=2, both rms pods are not forming a cluster as they might have not configured with discovery in our configuration. To achieve this , we will need additional configuration which will vary as per cluster providers as AS2/Ignite/FTL. This will again need changes in existing templates. So I would suggest we can leverage this enhancement in future release if there is no any immediate customer requirement. For now we can remove the replicas field from values.yaml and this can be addressed later.

Kindly provide your suggestions.

rameshpolishetti commented 3 years ago

Yes, @jgirase making RMS deployment work with multiple instances would require additional services for discovery mechanism as per the cluster provider AS2/Ignite/FTL like how it's done for BE application deployment.

Based on the earlier notes from @vpatil-tibco, my understanding is that a single RMS instance is sufficient for current needs. For now, we can go with RMS single instance (i.e no option to increase replica count via values.yaml) and can be enhanced if required in the future.

vpatil-tibco commented 3 years ago

Even before that, there is some issue with the current deployment topology, so we are changing RMS deployment to leverage existing be charts. Hopefully that should cover the replica bit as well.

rameshpolishetti commented 3 years ago

This feature is available in the PR #213

jgirase commented 3 years ago

@rameshpolishetti @vpatil-tibco @vilasshelar

We have tried couple of use cases of RMS with more than one replicas with different cluster configurations as Ignite + Ignite + SN, Ignite + Ignite + Mysql. However we could observe few inconsistent issues as mentioned below,

1.For the first time deployment , rms pods gets restarted but somehow they gets started after some time

2.For Ignite + Ignite + SN, there are couple of exceptions on both pods as rms agents gets crashed. We are not sure about the exact cause but when we have increased the persistent volume storage size from 0.5Gi to 8Gi , rms pods are started successfully. Here also we need to debug the root cause of crash and if required need to update the storage size

3.After login to webstudio, there is NPE which is kind of inconsistent but other operations seems to be working fine. 2021 Jul 01 13:22:47.406 GMT Z ignitecassandrav4-rmsagent-0 ERROR [$default.be.mt$.Worker.9] - [driver.http] Exception while invoking rule function java.lang.NullPointerException at com.tibco.be.rms.functions.AuthorizationHelper.getACLManagerByProject(SourceFile:375) at com.tibco.be.rms.functions.AuthorizationHelper.ensureAccess(SourceFile:185) at be.gen.WebStudio.Core.RuleFunctions.Actions.WS_RF_FetchManagedProjectsList$oversizeName.WS_RF_FetchManagedProjectsList(WS_RF_FetchManagedProjectsList$oversizeName.java:71) at be.gen.WebStudio.Core.RuleFunctions.Controller.WS_RF_ActionFactory$oversizeName.WS_RF_ActionFactory(WS_RF_ActionFactory$oversizeName.java:219) at be.gen.WebStudio.Core.RuleFunctions.Controller.nullWS_RF_FrontControllerObject$.WS_RF_FrontController(nullWS_RF_FrontControllerObject$.java:66) at be.gen.WebStudio.Core.RuleFunctions.Controller.WS_RF_FrontController.invoke(WS_RF_FrontController.java:11) at com.tibco.cep.runtime.session.impl.RuleSessionImpl$1.doTxnWork(RuleSessionImpl.java:772) at com.tibco.cep.kernel.core.rete.BeTransaction.run(SourceFile:156) at com.tibco.cep.kernel.core.rete.BeTransaction.execute(SourceFile:101) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:783) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:747) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:743) at com.tibco.cep.driver.http.server.impl.servlet.PageFlowServlet$RuleFunctionExecTask.run(SourceFile:371) at com.tibco.cep.runtime.session.BEManagedThread.execute(BEManagedThread.java:457) at com.tibco.cep.runtime.session.BEManagedThread.run_from_queue(BEManagedThread.java:397) at com.tibco.cep.runtime.session.BEManagedThread.run(BEManagedThread.java:294)

4.For Ignite + Ignite +SN, we noticed inconsistency of artifacts like RTI sometimes its gets hot deployed otherwise it says artifact does not exist. Looks like data is not getting persisted correctly on PV which causes other exceptions.

Find the attached logs,deployment scripts for one of the scenario for reference. RMS_Ignite_SN.zip

vpatil-tibco commented 3 years ago

@jgirase - Can you share a run with debug logs.

jgirase commented 3 years ago

Hi Vikram,

We tried tests with RMS again with Ignite + Ignite + SN configuration to reproduce NPE. However , I was stuck with inconsistent data issue across two rms pods for which we have collected debug logs. Find the attached logs for reference. We are still working on NPE issue and will update you with debug logs.

Note: Before deploying RMS/CCA everything was deleted pv/pvc and data from efs to have a clean environment.

RMS_debug_SyncIssue.zip

vpatil-tibco commented 3 years ago

I don't see the null pointer issue you mentioned above in the these debug logs. Also what are the data inconsistencies you are seeing, can you give the usecase comparision. Also are these usecases validate on on-prem before? Multi RMS instances running? And have they been working as expected there?

priyanka-nawal commented 3 years ago

Tried running RMS with Ignite+Ignite+Cassandra mode, replica count:2 and was able to reproduce NPE issue. Steps performed are as follows:

Configure values.yaml to use Ignite as cluster provider, and cassandra as store provider
Set Replica:2, provisioning type: static
Persistence for rms-webstudio is set to true
Install helm release. Once both the pods are started successfully, copy webstudio folder in one of the pod
Login to WebStudio and Checkout project. No exception is observed in one pod and could successfully perform operations in Webstudio. But at the same time, NPE is observed continuously on another pod. Attached debug level logs for reference Ignite_Ignite_Cassandra_debug.zip

AdittyThakare commented 3 years ago

Deployed FTL+Ignite+None with RMS Replica=2 using helm chart and faced issues while hot deploying the created artifacts.

Steps performed:

Install helm release for RMS and CCA application with dynamic provisioning with following configuration: Cluster: FTL Cache: Ignite Persistence: None Provisioning: Dynamic Replica Count: RMS->2, CCA Cache->1 CCA Inference->1 rmsWebstudio: true
Copy WebStudio folder to any one RMS pod
Login to WebStudio
Checkout project
Create 2 RTI's. Save and commit them
Approve changes. Deploy RTI.

Observation:

UndeclaredThrowable exception occurs on WebStudio UI and following exception occurs in CCA Inference log on deploying RTI:

2021 Jul 06 10:56:59.093 GMT Z cc1-inf-0 ERROR [RMI TCP Connection(6)-192.168.35.127] - [runtime.service] Error invoking MBean operation
java.lang.Exception: Rule template deploy source dir [/opt/tibco/be/6.1/rms/shared/CreditCardApplication] is not a directory

RTI's are present only in 1 RMS pod(ftlrms-rmsagent-1) and absent in the other pod(ftlrms-rmsagent-0)

_root@ftlrms-rmsagent-1:/opt/tibco/be/6.1/examples/standard/WebStudio/CreditCardApplication/Rule_Templates# ls
Applicant_PreScreen.ruletemplate  Offers.ruletemplateinstance  PreScreenTemplateView.ruletemplateview  Screen.ruletemplateinstance  SpecialOffers.ruletemplate_

_root@ftlrms-rmsagent-0:/opt/tibco/be/6.1/examples/standard/WebStudio/CreditCardApplication/Rule_Templates# ls
Applicant_PreScreen.ruletemplate  PreScreenTemplateView.ruletemplateview  SpecialOffers.ruletemplate_

Attaching RMS debug logs for reference. RMS_FTLIgniteNone_Issue.zip

jgirase commented 3 years ago

Hi Vikram,

Debug logs for NPE issue are attached by Priyanka please refer the same.

Related to data synchronisation issue across rms pods , you can look up Aditty's comments. Its same issue observed with Ignite/FTL/AS2 cluster providers. When newly created RTI is committed and approved, its present in only one pod's webstudio folder.Its not even not available in rms-webstudio pv so this causes subsequent requests like deploy to fail if that request gets directed to pod where that artifact is not present.

For on-prem testing , we have tested with more than one rms servers on same machine. So the data present in webstudio folder or shared folder is same for multiple rms servers. We access those rms instances on different ports in different browsers. Here on cloud, multiple rms instances are accessed by single service/single url.

However, we are already started working on setting up rms servers across different machines and using shared/mounted directories for webstudio/shared folders for on-prem testing. We will update you with the findings.

Kindly find the attached logs with AS2 cluster provider. RMS_AS_SN.zip

vpatil-tibco commented 3 years ago

So seems like there are some synchronization issues around PV that are mounted across both pod's. Not all changes are available across both. Also this could be the same reason for the null pointer. The below log indicates that,

2021 Jul 06 06:43:05.640 GMT Z ignitecassdebug-rmsinf-1 DEBUG [main] - [RuleFunctions.Startup.WS_RF_ValidateACLConfig] [WS-Inference-class] Number of managed projects for url[/opt/tibco/be/6.1/examples/standard/WebStudio] - 0.0

There are no projects found in this path, so no ACL data can be loaded and hence the above null pointer.

We will run more tests to identify the synchronization issues around PV mounts. But meanwhile, confirm if all cases for single instance RMS is working as expected.

rameshpolishetti commented 3 years ago

RMS folder synchronization issues are observed when mount path contains symbolic links. For instance RMS shared mount path "/opt/tibco/be/latest/rms/shared" contains "latest" which is a symlink.

Its been fixed by using real folder path (without symlink) for mounting RMS folders (ex: /opt/tibco/be/6.1/rms/shared). This requires capturing the BE short version via values.yaml

@jgirase please take latest changes to validate this.

jgirase commented 3 years ago

Pulled the latest changes from same branch feature-helm-deploy-rms. I could still see the NPE on one of the RMS pods in case of replica=2 and cluster providers AS2 as well as Ignite. Persistence option is configured as shared nothing in both cases. I could not proceed with remaining webstudio operations as 'null' error comes continuously on UI and project list disappears intermittently.

Note: Used dynamic provisioning for PV/PVC

Find the debug logs for both tests. RMS_Ignite_SN_NPE_debug_logs.zip RMS_AS2_SN_NPE_Debug_logs.zip

vpatil-tibco commented 3 years ago

Here the issue is the PV mount is taking some time to initialize as a result, when the RMS engine is coming up, the ACL initialization does not happen due to missing contents in the folder.

2021 Jul 06 06:43:05.640 GMT Z ignitecassdebug-rmsinf-1 DEBUG [main] - [RuleFunctions.Startup.WS_RF_ValidateACLConfig] [WS-Inference-class] Number of managed projects for url[/opt/tibco/be/6.1/examples/standard/WebStudio] - 0.0

We will need to check if some configuration can be tweaked to make the PV initialization right away as the engine starts.

Worse case, we will need to do some kind of code patch to do a continuous lookup for folder contents.

@jgirase - Focus on single instance RMS, if anything is pending. We want to do a release EOW.

jgirase commented 3 years ago

Kindly refer below steps for further debugging NPE, 1.Install helm release for RMS and CCA application with dynamic provisioning with following configuration: Cluster: AS2 Cache: AS2 Persistence: SharedNothing Provisioning: Dynamic Replica Count: RMS->2, CCA Cache->1 CCA Inference->1 rmsWebstudio: true 2.Wait till both rms pods are started successfully 3.Copy WebStudio folder to any one RMS pod from local machine 4.Login to WebStudio 5.Checkout project 6.Check logs for both rms pods, npe will be observed on one of the pod.

Data seems to be synchronised now between both pods as the directory/file created inside one pod reflects in another pod immediately with latest changes which was not a case earlier.

For single rms instance we have verified few configs and so far there are no issues observed. We will continue to validate remaining rms configurations.

LakshmiMekala commented 3 years ago

@vilasshelar @jgirase @AdittyThakare @priyanka-nawal I have went trhrough all the attached logs and i could see rms security folder is not mounted, could verify running the same test for all scenarios with rmsSecurity:true and rmsWebstudio: true. By default security folder should be mounted across multiple replicas.

AdittyThakare commented 3 years ago

@vpatil-tibco @LakshmiMekala With latest code tried RMS with multiple replica for following 3 configurations: a) AS2+AS2+None b) FTL+Ignite+Mysql c) AS2+AS2+SN

Steps are as follows:

Deploy RMS and CCA release using following config: Provisioning: Dynamic Replica Count: RMS-2, CCA Cache-2, CCA Inference-1 rmsWebstudio: true rmsSecurity: true
After both the pods start successfully, copy WebStudio and Security folders to any one RMS pod
Expose 8090 port using port forwarding or by adding port in cluster security group
Login to WebStudio UI

Observations:

Copied WebStudio and Security folder contents are present on both RMS pods
For #a and #b port-forwarding was used, the UI got loaded fast and end to end operations worked fine without NPE issue
For #c first port forwarding was used, UI got launched using http://localhost:8090/WebStudio. Dashboard got loaded pretty fast and checkout operation worked fine without any exception Then UI was accessed using Node IP address like http://18.139.110.187:31558/WebStudio. The UI didn't get loaded successfully and NPE occurred in one of the RMS pods. Then for any operation i.e create RT, switch tabs its taking very long and Server Error:null occurs on UI and NPE in RMS pod logs

From the above tests looks like when UI is accessed without port forwarding NPE issues are occurring.

Attaching logs for all 3 flows. AS2_+None.zip AS2+SN.zip FTL+Mysql.zip UI_NullError

LakshmiMekala commented 3 years ago

@AdittyThakare To get rid of npe issue, Once you deploy RMS with multiple replicas and rms pods gets into running state, copy artifacts of webstudio and security folders and then restart the rms pods and try to login and perform actions

AdittyThakare commented 3 years ago

Again executed following flows with RMS replica=2 and NPE was not encountered now: a. AS2+AS2+SN -> Dynamic provisioning b. FTL+Ignite+Mysql -> Static provisioning c. Ignite+Ignite+None -> Dynamic provisioning

Steps followed for Static and Dynamic provisioning were as follows: Dynamic Provisioning

Install RMS and CCA release
Copy WebStudio and Security to any 1 RMS pod
Delete RMS pods so that they restart
Once pods are restarted and in running state, login to RMS and perform operations

Static Provisioning

Copy WebStudio and Security to EFS mount location as mentioned in values.yaml e.g /mnt/efs/fs1/volume5/rms-security/ and /mnt/efs/fs1/volume5/rms-webstudio
Install RMS and CCA release
Once pods are in running state, login to RMS and perform operations

Attaching logs for reference. [Uploading MultipleReplica_19thJuly.zip…]()

@LakshmiMekala Kindly document in Wiki the NPE issue and steps required to get rid of it for customer's reference.

rameshpolishetti commented 3 years ago

@AdittyThakare thank you for validating and sharing the summary. So the fix - "pre-populating rms-webstudio and mounting rms-security folder" is working fine.

As per the summary, Removed the persistence.rmsSecurity from values.yaml via PR #226 since single flag is enough to control whether to mount rms-webstudio & rms-security folders or not. (Note: Re-validation is not required)

@jgirase @AdittyThakare please close the issue if you are done with validating all scenarios.

jgirase commented 3 years ago

Closing it as we are done with mentioned use cases.

TIBCOSoftware / be-tools

Support HA for RMS deployment #197