cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Problems to access outside DNS root name servers #391

Closed akahles closed 8 years ago

akahles commented 8 years ago

We have a lab git server running to manage our code repo. So far we did not have any problems communicating with this server from hal. However, now, I get this error message

$ git push origin master
ssh: Could not resolve hostname SERVER: Temporary failure in name resolution
fatal: The remote end hung up unexpectedly

Contacting the server from other machines works flawlessly. I will send more detailed information in a separate e-mail.

lzamparo commented 8 years ago

Another data point: I'm unable to pull from docker hub on hal nodes from an active session:

[zamparol@gpu-2-13 ~]$ docker pull lzamparo/basset:latest
Pulling repository lzamparo/basset
Get https://index.docker.io/v1/repositories/lzamparo/basset/images: dial tcp: lookup index.docker.io: Temporary failure in name resolution

DNS gremlins?

tatarsky commented 8 years ago

Looking.

tatarsky commented 8 years ago

We are seeing major delays to upstream DNS servers. And also heavy traffic in and out of the hal network. I am trying to determine what that traffic is.

tatarsky commented 8 years ago

Traffic involves a system Switzerland. Looking further.

tatarsky commented 8 years ago

Not root cause. We have confirmed with MSKCC IT there are upstream connectivity issues. We've reduced some traffic flows to help short term. We will update as information becomes available.

akahles commented 8 years ago

Thanks for looking into this. Unfortunately, the connectivity problem is still persistent for me.

tatarsky commented 8 years ago

I've not said its resolved in any way. The problem continues. We have no further information. We believe there to be upstream connectivity issues impacting many things.

akahles commented 8 years ago

Now my commands went through (after quite a wait), so the reduction in traffic might have helped after all. Thanks!

tatarsky commented 8 years ago

Its variant and I believe due to packet loss a few hops away. I have no formal statement of that. Its based on my own tests. Packet loss causes a variety of odd problems and in particular naming resolution is being impacted by not being able to properly reach remote nameservers.

As I get updates I will provide them.

tatarsky commented 8 years ago

MSKCC IT continues to investigate. No further status at this time.

lzamparo commented 8 years ago

my DNS lookup succeeded just now too, FWIW.

tatarsky commented 8 years ago

We are attempting to work around the problem in various ways but there remains an overall problem of not yet determined origin at the network level.

tatarsky commented 8 years ago

We've been told other servers at MSKCC are having the same issue. Waiting for a response on a ticket opened by CBIO staff.

tatarsky commented 8 years ago

Problem remains under investigation with MSKCC ISP. Our workaround may abate some of the problems but not all. I will keep people informed but the underlying problem is NOT resolved.

tatarsky commented 8 years ago

I have been informed that the ISP has claimed the issue observed is fixed on their end. And my dig based tests appear to confirm that.

No further data at this time.

I am leaving the workaround in place that provided root DNS server connectivity during the period it was basically appearing to be blocked for several of them past our network attachment point.

I will revert to standard caching nameserver configuration if I do not see this re-occur in 72 hours.

I am adding a set of resolver checks that act as "canaries" of similar problems (as in canaries in a coal mine). Thank you for being todays canaries.

I will leave this open to remind me of that effort.

tatarsky commented 8 years ago

Automated checks added. We'll see if this re-appears.