OxalisCommunity / oxalis

Oxalis - PEPPOL Access Point open source implementation - Core component
Other
121 stars 90 forks source link

SML lookup problems #494

Closed carher closed 7 months ago

carher commented 3 years ago

Hi! In production we are experiencing weird SML problems that come and go. This is something that never was an issue in our test environment, and so far it has affected only one recipient. Which could mean something, or could be totally up to chance, I don't know.

From time to time, when trying to send a file, Oxalis throws an exception:

n.d.v.p.l.a.LookupException: Identifier '0192:986252932' is not registered in SML.\n\tat n.d.v.p.l.l.BusdoxLocator.lookup(BusdoxLocator.java:60)\n\tat n.d.v.p.l.LookupClient.getServiceMetadata(LookupClient.java:80)\n\tat n.d.v.p.l.LookupClient.getEndpoint(LookupClient.java:114)\n\t

Most of the time, this lookup seems to work, and suddenly it fails. It may then even fail for quite a while before suddenly working again.

I have been trying to figure out which URL we actually access, in order to do some network supervision, but this doesn't seem very straight forward. Could anyone help us with that? Or even better, help us understand why this issue arises?

We first thought the issue was a centralized problem, but CEF-EDELIVERY-SUPPORT@ec.europa.eu suggests it is a localized problem on our end:

Let us inform you, that as your issue is related to the lookup phase, the SML itself was not involved into it, because the responsible component is the DNS service.

For that reason we have raised a ticket to our DataCenter which maintains the DNS servers (that is the reason of the slower response).

They confirmed, that no DNS issue were reported for other DNS entry in zone tech.ec.europa.eu like this one, at Reverse Proxy level service was stable and no anomalies either detected on Load balancers.

So this means we could not detect any issue on our side.

Probably it was a temporary network issue on your side.

We suggest in the case similar issue would happen next time, to do DNS queries without access point(using nslookup or dig).

aksamit commented 3 years ago

Hi @carher ,

maybe if you could provide the complete stack trace it would be possible to deduct what is the cause of this exception.

I see three possible causes here:

The SML lookup URL for the Service Group interface for this participant is http://b-af698985e2b17102e6a02222b52f9ba9.iso6523-actorid-upis.edelivery.tech.ec.europa.eu/iso6523-actorid-upis%3A%3A0192%3A986252932 From that interface you can see all available Service Metadata Reference URLs.

To obtain a specific Service Metadata Reference URL you can use this tool: https://my.galaxygw.com/participantlookup/#/0192/986252932 Then click on any of the red boxes for each document type and the link is available under "Document SMP URL".

To give some perspective, I checked some send statistics obtained from Tickstars Galaxy Gateway service, and we processed around 130 transactions to this receiver for the past 3 months, without a single SMP lookup problem. So I suspect either a problem in the SMP lookup implementation of Oxalis, not likely since it would have been a known problem, given that large user base, so I believe the problem might be in your local setup, possibly with DNS.

Hope this helps you look into the right direction.

carher commented 3 years ago

Thanks for the reply. We continue to look into DNS or other network issues. The lookup URL you provided should help us a lot.

This is the complete stacktrace from the log:

"stacktrace":"n.d.v.p.l.a.LookupException: Identifier '0192:986252932' is not registered in SML.\n\t at n.d.v.p.l.l.BusdoxLocator.lookup(BusdoxLocator.java:60)\n\t at n.d.v.p.l.LookupClient.getServiceMetadata(LookupClient.java:80)\n\t at n.d.v.p.l.LookupClient.getEndpoint(LookupClient.java:114)\n\t at n.d.o.o.l.CachedLookupService.load(CachedLookupService.java:79)\n\t at n.d.o.o.l.CachedLookupService.load(CachedLookupService.java:46)\n\t at c.g.c.c.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528)\n\t at c.g.c.c.LocalCache$Segment.loadSync(LocalCache.java:2277)\n\t at c.g.c.c.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)\n\t at c.g.c.c.LocalCache$Segment.get(LocalCache.java:2044)\n\t at c.g.c.c.LocalCache.get(LocalCache.java:3952)\n\t at c.g.c.c.LocalCache.getOrLoad(LocalCache.java:3974)\n\t at c.g.c.c.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)\n\t at n.d.o.o.l.CachedLookupService.lookup(CachedLookupService.java:71)\n\t ... 6 common frames omitted\n Wrapped by: n.d.o.a.l.OxalisTransmissionException: Identifier '0192:986252932' is not registered in SML.\n\t at n.d.o.o.l.CachedLookupService.lookup(CachedLookupService.java:73)\n\t at n.d.o.a.l.LookupService.lookup(LookupService.java:57)\n\t at n.d.o.o.t.TransmissionRequestBuilder.build(TransmissionRequestBuilder.java:208)\n\t at n.s.p.o.OutgoingHandler.transmitFile(OutgoingHandler.java:330)\n\t at n.s.p.o.OutgoingHandler.performPoll(OutgoingHandler.java:237)\n\t at n.s.i.c.c.PermanentLeaderElector.doIfLeader(PermanentLeaderElector.java:76)\n\t at n.s.p.o.OutgoingHandler.run(OutgoingHandler.java:186)\n"

What is curious, is how most of the time we successfully send messages to this recipient, and then suddenly it will fail for some time, and without any interaction from us it will start working again.

EDIT: Adding info that we are using oxalis version 4.0.4, and it seems that BusdoxLocator comes from peppol-lookup-1.1.2.jar.

carher commented 3 years ago

Update: After running tcpdump and experiencing both successful and unsuccessful attempts, it is still quite difficult to understand exactly what is happening. However, what we often see is a delayed FIN, ACK that doesn't seem successful, and that gets retransmitted a lot of times. The key word here could be delayed - it seems that Oxalis somehow doesn't want to close the TCP session even after HTTP 200 is received? Is this a known issue?

FrodeBjerkholt commented 3 years ago

It is worth mentioning that the BusdoxLocator implementation is a bit simple. It will respond with "Identifier '%s' is not registered in SML." even in situations where the SML responds with a HTTP status like 500 or when there is a Connection Timeout. I think that there should be some code that checks the reason why new Lookup(hostname).run() == null. I see for instance that the Lookup class has a getErrorString method.

    @Override
    public URI lookup(ParticipantIdentifier participantIdentifier) throws LookupException {
        // Create hostname for participant identifier.
        String hostname = hostnameGenerator.generate(participantIdentifier);

        try {
            if (new Lookup(hostname).run() == null)
                throw new LookupException(
                        String.format("Identifier '%s' is not registered in SML.", participantIdentifier.getIdentifier()));
        } catch (TextParseException e) {
            throw new LookupException(e.getMessage(), e);
        }

        return URI.create(String.format("http://%s", hostname));
    }
carher commented 3 years ago

Still trying to figure out this problem. I have downloaded the code and look around. Sometimes I find log statements encapsuled by code like this: if (Options.check("verbose"))

I haven't figured out how to turn on verbose, though. What is the trick?

tomasgr commented 3 years ago

We have had the samme issue for the last 4 weeks with sporadic timeouts on lookup. @carher, check out #493, we are currently trying it and no issues so far

aaron-kumar commented 3 years ago

PR #493 added via commit e6ae52f8c0f3c311f9c959e17ae07767b77f9917. Will be available in 5.0.0.RC2

aaron-kumar commented 1 year ago

@carher : Are you still facing this issue? I can't see your comment after HTTP client based fix, I assume that you are Not facing this issue anymore.

Note: This issue will automatically moved to discussion if No response received in 1 week time.

aaron-kumar commented 7 months ago

Closing issue , since no such issue was reported since more than 1 year.