Closed akahles closed 8 years ago
Another data point: I'm unable to pull from docker hub on hal nodes from an active session:
[zamparol@gpu-2-13 ~]$ docker pull lzamparo/basset:latest
Pulling repository lzamparo/basset
Get https://index.docker.io/v1/repositories/lzamparo/basset/images: dial tcp: lookup index.docker.io: Temporary failure in name resolution
DNS gremlins?
Looking.
We are seeing major delays to upstream DNS servers. And also heavy traffic in and out of the hal network. I am trying to determine what that traffic is.
Traffic involves a system Switzerland. Looking further.
Not root cause. We have confirmed with MSKCC IT there are upstream connectivity issues. We've reduced some traffic flows to help short term. We will update as information becomes available.
Thanks for looking into this. Unfortunately, the connectivity problem is still persistent for me.
I've not said its resolved in any way. The problem continues. We have no further information. We believe there to be upstream connectivity issues impacting many things.
Now my commands went through (after quite a wait), so the reduction in traffic might have helped after all. Thanks!
Its variant and I believe due to packet loss a few hops away. I have no formal statement of that. Its based on my own tests. Packet loss causes a variety of odd problems and in particular naming resolution is being impacted by not being able to properly reach remote nameservers.
As I get updates I will provide them.
MSKCC IT continues to investigate. No further status at this time.
my DNS lookup succeeded just now too, FWIW.
We are attempting to work around the problem in various ways but there remains an overall problem of not yet determined origin at the network level.
We've been told other servers at MSKCC are having the same issue. Waiting for a response on a ticket opened by CBIO staff.
Problem remains under investigation with MSKCC ISP. Our workaround may abate some of the problems but not all. I will keep people informed but the underlying problem is NOT resolved.
I have been informed that the ISP has claimed the issue observed is fixed on their end. And my dig based tests appear to confirm that.
No further data at this time.
I am leaving the workaround in place that provided root DNS server connectivity during the period it was basically appearing to be blocked for several of them past our network attachment point.
I will revert to standard caching nameserver configuration if I do not see this re-occur in 72 hours.
I am adding a set of resolver checks that act as "canaries" of similar problems (as in canaries in a coal mine). Thank you for being todays canaries.
I will leave this open to remind me of that effort.
Automated checks added. We'll see if this re-appears.
We have a lab git server running to manage our code repo. So far we did not have any problems communicating with this server from
hal
. However, now, I get this error messageContacting the server from other machines works flawlessly. I will send more detailed information in a separate e-mail.