cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

READ FIRST: Reporting Update Problems #385

Closed tatarsky closed 8 years ago

tatarsky commented 8 years ago

I am going to ask that everyone attempt to check Git first before reporting any problems. All users of Hal should strongly consider following this Git group for a few days as we work out any kinks.

A CONSIDERABLE level of physical change has occurred and a fair level of software update and scheduler config migration to a new host.

Problems will be handled in the order received and also by impact level. No new requests for items will be processed for a bit as we go through any update tasks.

jchodera commented 8 years ago

Can you also link to the documentation here for all of the changes that have been made?

tatarsky commented 8 years ago

When said documentation is compiled. I have a running set of notes and will place it as a wiki shortly.

jchodera commented 8 years ago

Suggestion:

tatarsky commented 8 years ago

I was starting items here. More to follow.

https://github.com/cBio/cbio-cluster/wiki/Upgrade-Effort-March-6th-to-March-9th-2016-Hal-Cluster

akahles commented 8 years ago

Thanks for successfully completing this enormous effort to everybody involved! I am not quite sure whether this is the correct place to report issues - please correct me if not.

Before the update it was possible to ssh into nodes to check the status of single processes. This seems no longer possible:

ssh cpu-6-2
Password: 
tatarsky commented 8 years ago

I would prefer a separate Git issue per item. I will however look as we've not changed node login abilities.

tatarsky commented 8 years ago

Newer SSH is slightly more picky on reverse DNS match and I'm fixing it. (involves our allowing of the alias "hal" for mskcc-ln1.fast)

tatarsky commented 8 years ago

Please retry. Was actually I believe only on those cpu-* nodes due to a config item I am adding to puppet. Please confirm however it "works for you" now to both those and say a GPU node.

akahles commented 8 years ago

Thanks for looking into this. Works now on many nodes but not on gpu-1-4, gpu-2-4, gpu-2-5, gpu-2-6, gpu-2-7 up to gpu-2-17. It works on the cpu-* and cc* nodes.

akahles commented 8 years ago

gpu-2-8 seems to have another issue: ssh gpu-2-8 ssh: connect to host gpu-2-8 port 22: No route to host

tatarsky commented 8 years ago

gpu-2-8 is broken. Exxact has been notified. Checking for same config item on the nodes you mention.

tatarsky commented 8 years ago

Please try just gpu-1-4 so I can make sure I'm fixing the right thing.

kuod commented 8 years ago

@akahles stepped away from his desk but I can confirm I can ssh into gpu-1-4

akahles commented 8 years ago

Yep, works now.

tatarsky commented 8 years ago

OK. I believe I see the matter. I will be fixing the /etc/hosts entry for hal on nodes we put in to all the use of that name as an alias. That is conflicting with now default DNS validation of the reverse IP of the system which is not actually named hal. (SSH has become more strict on its hostbased auth checks).

I will advise when done.

tatarsky commented 8 years ago

Verify one more time for me as I'm trying to puppetize the fix on gpu-1-4. There is a larger fix which I will do later involving hostbased authentication within the cluster but I want to review the security items that were added.

akahles commented 8 years ago

Sorry for being absent, was in a meeting. Can confirm login to all nodes but gpu-1-8, which is offline for other reasons. Ran for node in $(pbsnodes -a | grep -e "^[^[:space:]]"); do echo $node; ssh $node sleep 1; done.

tatarsky commented 8 years ago

Please confirm you really meant gpu-2-8. I show no issues on gpu-1-8.

akahles commented 8 years ago

Sorry - gpu-2-8 ... typo. Approaching the end of the week ...

tatarsky commented 8 years ago

I hear you man ;)

vkuryavyi commented 8 years ago

I have submitted 40 jobs to hal cluster this morning (03.14). Usually, they all quickly pass the 'Q' status. This time, I see only 8 jobs running at a time while all others are in the Q. I see a total of 173 jobs running on the cluster.

tatarsky commented 8 years ago

@vkuryavyi please open a git issue directly. Do not add here.