OxalisCommunity / Oxalis-AS4

PEPPOL AS4 pMode plugin for Oxalis
32 stars 25 forks source link

Exception while receiving response from sending message #185

Closed HegeErdal closed 2 years ago

HegeErdal commented 2 years ago

Using Oxalis 5.1.0, AS4 in our production enviroment.

We sometimes get an error in our production enviroment while receiving the response from PEPPOL after sending a message. We later get a confirmation that the message has in fact been delivered, so it's not an error connected to actually delivering the message but more an issue connected to receiving the response.

Caused by: javax.xml.ws.soap.SOAPFaultException: Trying to add an already initialized cache. If you are adding a decorated cache, use CacheManager.addDecoratedCache(Ehcache decoratedCache) instead. at org.apache.cxf.jaxws.DispatchImpl.mapException(DispatchImpl.java:285) at org.apache.cxf.jaxws.DispatchImpl.invoke(DispatchImpl.java:330) at org.apache.cxf.jaxws.DispatchImpl.invoke(DispatchImpl.java:241) at network.oxalis.as4.outbound.As4MessageSender.invoke(As4MessageSender.java:105) ... 115 common frames omitted Caused by: net.sf.ehcache.CacheException: Trying to add an already initialized cache. If you are adding a decorated cache, use CacheManager.addDecoratedCache(Ehcache decoratedCache) instead. at net.sf.ehcache.CacheManager.addCacheNoCheck(CacheManager.java:1407) at net.sf.ehcache.CacheManager.addCacheIfAbsent(CacheManager.java:1931) at org.apache.cxf.ws.security.tokenstore.EHCacheTokenStore.(EHCacheTokenStore.java:68) at org.apache.cxf.ws.security.tokenstore.EHCacheTokenStoreFactory.newTokenStore(EHCacheTokenStoreFactory.java:37) at org.apache.cxf.ws.security.tokenstore.TokenStoreUtils.getTokenStore(TokenStoreUtils.java:57) at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.getCallback(WSS4JInInterceptor.java:680) at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.getCallback(WSS4JInInterceptor.java:623) at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.handleMessageInternal(WSS4JInInterceptor.java:264) at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.handleMessage(WSS4JInInterceptor.java:211) at org.apache.cxf.ws.security.wss4j.PolicyBasedWSS4JInInterceptor.handleMessage(PolicyBasedWSS4JInInterceptor.java:123) at org.apache.cxf.ws.security.wss4j.PolicyBasedWSS4JInInterceptor.handleMessage(PolicyBasedWSS4JInInterceptor.java:76) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.endpoint.ClientImpl.onMessage(ClientImpl.java:829) at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponseInternal(HTTPConduit.java:1696) at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.handleResponse(HTTPConduit.java:1570) at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1371) at org.apache.cxf.ext.logging.LoggingOutputStream.postClose(LoggingOutputStream.java:53) at org.apache.cxf.io.CachedOutputStream.close(CachedOutputStream.java:228) at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56) at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:671) at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:63) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) at org.apache.cxf.endpoint.ClientImpl.doInvoke(ClientImpl.java:530) at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:441) at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:356) at org.apache.cxf.endpoint.ClientImpl.invoke(ClientImpl.java:314) at org.apache.cxf.endpoint.ClientImpl.invokeWrapped(ClientImpl.java:349) at org.apache.cxf.jaxws.DispatchImpl.invoke(DispatchImpl.java:322)

aaron-kumar commented 2 years ago

@HegeErdal : You can safely ignore this unless this is affecting the transmission of the document. In case, it is affecting transmission then I need more details about criteria /condition when you face this issue?

aaron-kumar commented 2 years ago

@HegeErdal Any update on this issue. Stale issue will be automatically closed.

HegeErdal commented 2 years ago

I don't have any updates on this issue. We still just ignore this issue, though I'm not sure I like it as this basically happens during the validation of the respons from the PEPPOL network. Though, it has happened very rarely the last 2 months, so we won't put any more efforts into finding a solution.

technical-debt-collector commented 2 years ago

We too experience issues with this. Are these errors safe to ignore, i.e. is this guaranteed to occur only after an otherwise successful transmission or can it also occur after transmission failures? In case of the former then we'll put something in place to ignore this too but we'd prefer having this fixed if that's possible.

aaron-kumar commented 2 years ago

@HegeErdal , @technical-debt-collector : Since you mentioned that you occasionally face this issue. Can you let me know scenarios/conditions when it occur so that I can reproduce the issue and provide fix if possible. Though based on my initial analysis on stacktrace it is safe to ignore. But if you feel that it's frequency is high, affecting transmission or problem getting desired response then please share results/scenarios/conditions when it occur.

technical-debt-collector commented 2 years ago

I can't really see this occurring under specific conditions. But for the last 7 days we've experienced this a total of 27 times. 2022-09-01 - 47104 transmissions - error occurred 3 times 2022-08-31 - 21276 transmissions - error occurred 7 times 2022-08-30 - 24313 transmissions - error occurred 8 times 2022-08-29 - 21162 transmissions - error occurred 3 times 2022-08-28 - 4786 transmissions - error occurred 3 times 2022-08-27 - 7998 transmissions - error occurred 1 time 2022-08-26 - 22137 transmissions - error occurred 2 times They look pretty spaced out, and rarely occur more than once per 3-10 hours (although I suppose the time inbetween the occurrences depend entirely on when we do transmissions).

I'm guessing the number of transmissions don't matter either as the error makes it sound like some cache has expired and needs reinitialization, unfortunately claiming a victim or two in the process :)

dladlk commented 2 years ago

Can it be that you run Oxalis sending in parallel threads? Not one-by-one, but with potentially several documents at the moment on the same instance of Oxalis?

So it is some kind of concurrent caching of receiver certificate - and if 2 threads send some payload to the same AP certificate - one of them fails to properly process receiver response.

technical-debt-collector commented 2 years ago

Could definitely be something like that. Our application is essentially just a Quarkus API that handles transmissions for incoming requests, thus potentially running a number of concurrent requests to the same AP equal to the number of the underlying Quarkus worker thread pool.

dladlk commented 2 years ago

If possible, maybe you can confirm it somehow via log analysis? E.g. when you get such error, was there an attempt to send a document to the same AP or with the same certificate. If you can prove it, it should be easier for us to reproduce the issue to fix such case... As I understand, the whole document is considered undelivered - but actually it was successfully received, right? This issue can lead to duplication of documents at receiver side which is not good...

technical-debt-collector commented 2 years ago

This solution sends to 1-3 accesspoints in total. A few of the consumers we're handling transmissions for send significantly more transmissions than the rest, and these generally send to the same AP. It is also typical for the consumers to send batches of transmissions at a time. I'm not sure if I'll be able to confirm there being another document transmission taking place concurrently with another one that fails with this error, but I'd say it's very likely that this can happen.

All the documents I've checked seems to have been successfully transmitted after receiving this error, currently. From what I understand the transmission completes successfully, and if the error occurs then that's what we're left with. So essentially we're no longer able to prove that the transmission went successfully, but yet receive receipts from the receiver indicating otherwise.

technical-debt-collector commented 2 years ago

If @aaron-kumar thinks the error can be safely ignored then that's definitely reassuring. So far we haven't seen a case in which this error has occurred after a failed transmission (although I haven't checked all of our transmissions that are set to failed because of this error).

aaron-kumar commented 2 years ago

@technical-debt-collector I asked following question: "But if you feel that it's frequency is high, affecting transmission or problem getting desired response then please share results/scenarios/conditions when it occur." This question also includes whether there was successful transmission and whether receipt file persisted in your system correcty to prove successful transmission/acknowledgement.

I also want to know whether this error terminate transmission in between or it will continue? Are you re-sending in case your connection terminate during the reception of response?

If you have receipt file even for cases where you getting this stacktrace and transmission is Not terminated prematurely and you are Not re-transmitting after successful persistence of receipt file then it is safe to ignore.

technical-debt-collector commented 2 years ago

@aaron-kumar , I've looked through our logs more thoroughly, though only about 15 days back in time as it's quite cumbersome to check each case. From what I can see, all transmissions affected in this time range seem to have completed successfully. We know this because all these failures have gotten a positive receipt back from the receiver.

Based on observations mentioned above it sounds to me like the error only occurs after a successful transmission, although I personally don't think the number of these failures and their result is high enough to appear convincing. No, we're not attempting any resending in these cases.

aaron-kumar commented 2 years ago

Based on your response @technical-debt-collector , I am moving it into discussion so that we can continue keep on tracking the things. In future version, we would like to address this cache related stacktrace.