Closed vkuznet closed 2 years ago
I assume this is intermittent? The first thing I’ll check is that the dns entries all have an ingress on them
Didn't your hosts have some DNS issue before? Because I confirm that even from my desktop at home, cms-rucio.cern.ch resolves correctly to two IPs. And both those nodes in k8s have role=ingress.
The error message sure looks like basic DNS lookup is failing, not that it's failing to connect or find a service on the IP address.
yes, the issue is intermittent, sometimes everything is fine while another attempt it is not. I can see see in production k8s and at my home running DAS from local laptop. I have suspicion that it is related to the concurrent load on DNS server(s).
I think it is old discussion of some racing conditions in Go network stack. I changed production server to use a queue and constraint it to max of 100 concurrent requests calls. After few test iterations I no longer see the problem. I'll leave ticket open and will check site queries to verify if it fix the problem.
and I found yet another discussion on this topic, see the following ticket
we no longer see this issue, closing
I got report from Felipe Gómez-Cortés who claimed that DAS web UI provides different results for the following query:
After series of iterations I confirmed that this is the case using my dev environment. I identified that the problem is related to unability of DAS to contact with cms-rucio service yield the following errors:
We need to identify the source of this issue. @ericvaandering any ideas?