Closed jgirase closed 3 years ago
Fixed with PR #194.
@jgirase validate the fix
@vpatil-tibco We tried with more than one replica for RMS deployment, following are the observations which needs attention,
1.When Replica=2, there are 2 pods are running for RMS and one service which is expected. However when we login to webstudio, user gets logged out automatically with error as "Invalid API Key [workspace]" in one of rms pod logs. We tried login again and could see weird issues in browser and it seems only one pod is accepting the requests. In this case both pods are running independently without connecting to each other.
2.With Replica=2, both rms pods are not forming a cluster as they might have not configured with discovery in our configuration. To achieve this , we will need additional configuration which will vary as per cluster providers as AS2/Ignite/FTL. This will again need changes in existing templates. So I would suggest we can leverage this enhancement in future release if there is no any immediate customer requirement. For now we can remove the replicas field from values.yaml and this can be addressed later.
Kindly provide your suggestions.
Yes, @jgirase making RMS deployment work with multiple instances would require additional services for discovery mechanism as per the cluster provider AS2/Ignite/FTL like how it's done for BE application deployment.
Based on the earlier notes from @vpatil-tibco, my understanding is that a single RMS instance is sufficient for current needs. For now, we can go with RMS single instance (i.e no option to increase replica count via values.yaml) and can be enhanced if required in the future.
Even before that, there is some issue with the current deployment topology, so we are changing RMS deployment to leverage existing be charts. Hopefully that should cover the replica bit as well.
This feature is available in the PR #213
@rameshpolishetti @vpatil-tibco @vilasshelar
We have tried couple of use cases of RMS with more than one replicas with different cluster configurations as Ignite + Ignite + SN, Ignite + Ignite + Mysql. However we could observe few inconsistent issues as mentioned below,
1.For the first time deployment , rms pods gets restarted but somehow they gets started after some time
2.For Ignite + Ignite + SN, there are couple of exceptions on both pods as rms agents gets crashed. We are not sure about the exact cause but when we have increased the persistent volume storage size from 0.5Gi to 8Gi , rms pods are started successfully. Here also we need to debug the root cause of crash and if required need to update the storage size
3.After login to webstudio, there is NPE which is kind of inconsistent but other operations seems to be working fine. 2021 Jul 01 13:22:47.406 GMT Z ignitecassandrav4-rmsagent-0 ERROR [$default.be.mt$.Worker.9] - [driver.http] Exception while invoking rule function java.lang.NullPointerException at com.tibco.be.rms.functions.AuthorizationHelper.getACLManagerByProject(SourceFile:375) at com.tibco.be.rms.functions.AuthorizationHelper.ensureAccess(SourceFile:185) at be.gen.WebStudio.Core.RuleFunctions.Actions.WS_RF_FetchManagedProjectsList$oversizeName.WS_RF_FetchManagedProjectsList(WS_RF_FetchManagedProjectsList$oversizeName.java:71) at be.gen.WebStudio.Core.RuleFunctions.Controller.WS_RF_ActionFactory$oversizeName.WS_RF_ActionFactory(WS_RF_ActionFactory$oversizeName.java:219) at be.gen.WebStudio.Core.RuleFunctions.Controller.nullWS_RF_FrontControllerObject$.WS_RF_FrontController(nullWS_RF_FrontControllerObject$.java:66) at be.gen.WebStudio.Core.RuleFunctions.Controller.WS_RF_FrontController.invoke(WS_RF_FrontController.java:11) at com.tibco.cep.runtime.session.impl.RuleSessionImpl$1.doTxnWork(RuleSessionImpl.java:772) at com.tibco.cep.kernel.core.rete.BeTransaction.run(SourceFile:156) at com.tibco.cep.kernel.core.rete.BeTransaction.execute(SourceFile:101) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:783) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:747) at com.tibco.cep.runtime.session.impl.RuleSessionImpl.invokeFunction(RuleSessionImpl.java:743) at com.tibco.cep.driver.http.server.impl.servlet.PageFlowServlet$RuleFunctionExecTask.run(SourceFile:371) at com.tibco.cep.runtime.session.BEManagedThread.execute(BEManagedThread.java:457) at com.tibco.cep.runtime.session.BEManagedThread.run_from_queue(BEManagedThread.java:397) at com.tibco.cep.runtime.session.BEManagedThread.run(BEManagedThread.java:294)
4.For Ignite + Ignite +SN, we noticed inconsistency of artifacts like RTI sometimes its gets hot deployed otherwise it says artifact does not exist. Looks like data is not getting persisted correctly on PV which causes other exceptions.
Find the attached logs,deployment scripts for one of the scenario for reference. RMS_Ignite_SN.zip
@jgirase - Can you share a run with debug logs.
Hi Vikram,
We tried tests with RMS again with Ignite + Ignite + SN configuration to reproduce NPE. However , I was stuck with inconsistent data issue across two rms pods for which we have collected debug logs. Find the attached logs for reference. We are still working on NPE issue and will update you with debug logs.
Note: Before deploying RMS/CCA everything was deleted pv/pvc and data from efs to have a clean environment.
I don't see the null pointer issue you mentioned above in the these debug logs. Also what are the data inconsistencies you are seeing, can you give the usecase comparision. Also are these usecases validate on on-prem before? Multi RMS instances running? And have they been working as expected there?
Tried running RMS with Ignite+Ignite+Cassandra mode, replica count:2 and was able to reproduce NPE issue. Steps performed are as follows:
Deployed FTL+Ignite+None with RMS Replica=2 using helm chart and faced issues while hot deploying the created artifacts.
Steps performed:
Observation:
UndeclaredThrowable exception occurs on WebStudio UI and following exception occurs in CCA Inference log on deploying RTI:
2021 Jul 06 10:56:59.093 GMT Z cc1-inf-0 ERROR [RMI TCP Connection(6)-192.168.35.127] - [runtime.service] Error invoking MBean operation
java.lang.Exception: Rule template deploy source dir [/opt/tibco/be/6.1/rms/shared/CreditCardApplication] is not a directory
RTI's are present only in 1 RMS pod(ftlrms-rmsagent-1) and absent in the other pod(ftlrms-rmsagent-0)
_root@ftlrms-rmsagent-1:/opt/tibco/be/6.1/examples/standard/WebStudio/CreditCardApplication/Rule_Templates# ls
Applicant_PreScreen.ruletemplate Offers.ruletemplateinstance PreScreenTemplateView.ruletemplateview Screen.ruletemplateinstance SpecialOffers.ruletemplate_
_root@ftlrms-rmsagent-0:/opt/tibco/be/6.1/examples/standard/WebStudio/CreditCardApplication/Rule_Templates# ls
Applicant_PreScreen.ruletemplate PreScreenTemplateView.ruletemplateview SpecialOffers.ruletemplate_
Attaching RMS debug logs for reference. RMS_FTLIgniteNone_Issue.zip
Hi Vikram,
Debug logs for NPE issue are attached by Priyanka please refer the same.
Related to data synchronisation issue across rms pods , you can look up Aditty's comments. Its same issue observed with Ignite/FTL/AS2 cluster providers. When newly created RTI is committed and approved, its present in only one pod's webstudio folder.Its not even not available in rms-webstudio pv so this causes subsequent requests like deploy to fail if that request gets directed to pod where that artifact is not present.
For on-prem testing , we have tested with more than one rms servers on same machine. So the data present in webstudio folder or shared folder is same for multiple rms servers. We access those rms instances on different ports in different browsers. Here on cloud, multiple rms instances are accessed by single service/single url.
However, we are already started working on setting up rms servers across different machines and using shared/mounted directories for webstudio/shared folders for on-prem testing. We will update you with the findings.
Kindly find the attached logs with AS2 cluster provider. RMS_AS_SN.zip
So seems like there are some synchronization issues around PV that are mounted across both pod's. Not all changes are available across both. Also this could be the same reason for the null pointer. The below log indicates that,
2021 Jul 06 06:43:05.640 GMT Z ignitecassdebug-rmsinf-1 DEBUG [main] - [RuleFunctions.Startup.WS_RF_ValidateACLConfig] [WS-Inference-class] Number of managed projects for url[/opt/tibco/be/6.1/examples/standard/WebStudio] - 0.0
There are no projects found in this path, so no ACL data can be loaded and hence the above null pointer.
We will run more tests to identify the synchronization issues around PV mounts. But meanwhile, confirm if all cases for single instance RMS is working as expected.
RMS folder synchronization issues are observed when mount path contains symbolic links. For instance RMS shared mount path "/opt/tibco/be/latest/rms/shared" contains "latest" which is a symlink.
Its been fixed by using real folder path (without symlink) for mounting RMS folders (ex: /opt/tibco/be/6.1/rms/shared). This requires capturing the BE short version via values.yaml
@jgirase please take latest changes to validate this.
Pulled the latest changes from same branch feature-helm-deploy-rms. I could still see the NPE on one of the RMS pods in case of replica=2 and cluster providers AS2 as well as Ignite. Persistence option is configured as shared nothing in both cases. I could not proceed with remaining webstudio operations as 'null' error comes continuously on UI and project list disappears intermittently.
Note: Used dynamic provisioning for PV/PVC
Find the debug logs for both tests. RMS_Ignite_SN_NPE_debug_logs.zip RMS_AS2_SN_NPE_Debug_logs.zip
Here the issue is the PV mount is taking some time to initialize as a result, when the RMS engine is coming up, the ACL initialization does not happen due to missing contents in the folder.
2021 Jul 06 06:43:05.640 GMT Z ignitecassdebug-rmsinf-1 DEBUG [main] - [RuleFunctions.Startup.WS_RF_ValidateACLConfig] [WS-Inference-class] Number of managed projects for url[/opt/tibco/be/6.1/examples/standard/WebStudio] - 0.0
We will need to check if some configuration can be tweaked to make the PV initialization right away as the engine starts.
Worse case, we will need to do some kind of code patch to do a continuous lookup for folder contents.
@jgirase - Focus on single instance RMS, if anything is pending. We want to do a release EOW.
Kindly refer below steps for further debugging NPE, 1.Install helm release for RMS and CCA application with dynamic provisioning with following configuration: Cluster: AS2 Cache: AS2 Persistence: SharedNothing Provisioning: Dynamic Replica Count: RMS->2, CCA Cache->1 CCA Inference->1 rmsWebstudio: true 2.Wait till both rms pods are started successfully 3.Copy WebStudio folder to any one RMS pod from local machine 4.Login to WebStudio 5.Checkout project 6.Check logs for both rms pods, npe will be observed on one of the pod.
Data seems to be synchronised now between both pods as the directory/file created inside one pod reflects in another pod immediately with latest changes which was not a case earlier.
For single rms instance we have verified few configs and so far there are no issues observed. We will continue to validate remaining rms configurations.
@vilasshelar @jgirase @AdittyThakare @priyanka-nawal I have went trhrough all the attached logs and i could see rms security folder is not mounted, could verify running the same test for all scenarios with rmsSecurity:true and rmsWebstudio: true. By default security folder should be mounted across multiple replicas.
@vpatil-tibco @LakshmiMekala With latest code tried RMS with multiple replica for following 3 configurations: a) AS2+AS2+None b) FTL+Ignite+Mysql c) AS2+AS2+SN
Steps are as follows:
Observations:
From the above tests looks like when UI is accessed without port forwarding NPE issues are occurring.
Attaching logs for all 3 flows. AS2_+None.zip AS2+SN.zip FTL+Mysql.zip
@AdittyThakare To get rid of npe issue, Once you deploy RMS with multiple replicas and rms pods gets into running state, copy artifacts of webstudio and security folders and then restart the rms pods and try to login and perform actions
Again executed following flows with RMS replica=2 and NPE was not encountered now: a. AS2+AS2+SN -> Dynamic provisioning b. FTL+Ignite+Mysql -> Static provisioning c. Ignite+Ignite+None -> Dynamic provisioning
Steps followed for Static and Dynamic provisioning were as follows: Dynamic Provisioning
Static Provisioning
Attaching logs for reference. [Uploading MultipleReplica_19thJuly.zip…]()
@LakshmiMekala Kindly document in Wiki the NPE issue and steps required to get rid of it for customer's reference.
@AdittyThakare thank you for validating and sharing the summary. So the fix - "pre-populating rms-webstudio and mounting rms-security folder" is working fine.
As per the summary, Removed the persistence.rmsSecurity
from values.yaml via PR #226 since single flag is enough to control whether to mount rms-webstudio & rms-security folders or not. (Note: Re-validation is not required)
@jgirase @AdittyThakare please close the issue if you are done with validating all scenarios.
Closing it as we are done with mentioned use cases.
Currently with RMS helm deployment , there is no configuration for replicas. We can add it for high availability of RMS server or cluster of RMS pods.