Closed tatarsky closed 8 years ago
Can you also link to the documentation here for all of the changes that have been made?
When said documentation is compiled. I have a running set of notes and will place it as a wiki shortly.
Suggestion:
Major Changes
section to the wiki where you dump the change notes nowI was starting items here. More to follow.
https://github.com/cBio/cbio-cluster/wiki/Upgrade-Effort-March-6th-to-March-9th-2016-Hal-Cluster
Thanks for successfully completing this enormous effort to everybody involved! I am not quite sure whether this is the correct place to report issues - please correct me if not.
Before the update it was possible to ssh
into nodes to check the status of single processes. This seems no longer possible:
ssh cpu-6-2
Password:
I would prefer a separate Git issue per item. I will however look as we've not changed node login abilities.
Newer SSH is slightly more picky on reverse DNS match and I'm fixing it. (involves our allowing of the alias "hal" for mskcc-ln1.fast)
Please retry. Was actually I believe only on those cpu-* nodes due to a config item I am adding to puppet. Please confirm however it "works for you" now to both those and say a GPU node.
Thanks for looking into this. Works now on many nodes but not on gpu-1-4
, gpu-2-4
, gpu-2-5
, gpu-2-6
, gpu-2-7
up to gpu-2-17
. It works on the cpu-*
and cc*
nodes.
gpu-2-8
seems to have another issue:
ssh gpu-2-8 ssh: connect to host gpu-2-8 port 22: No route to host
gpu-2-8 is broken. Exxact has been notified. Checking for same config item on the nodes you mention.
Please try just gpu-1-4
so I can make sure I'm fixing the right thing.
@akahles stepped away from his desk but I can confirm I can ssh into gpu-1-4
Yep, works now.
OK. I believe I see the matter. I will be fixing the /etc/hosts entry for hal on nodes we put in to all the use of that name as an alias. That is conflicting with now default DNS validation of the reverse IP of the system which is not actually named hal. (SSH has become more strict on its hostbased auth checks).
I will advise when done.
Verify one more time for me as I'm trying to puppetize the fix on gpu-1-4. There is a larger fix which I will do later involving hostbased authentication within the cluster but I want to review the security items that were added.
Sorry for being absent, was in a meeting. Can confirm login to all nodes but gpu-1-8
, which is offline for other reasons. Ran for node in $(pbsnodes -a | grep -e "^[^[:space:]]"); do echo $node; ssh $node sleep 1; done
.
Please confirm you really meant gpu-2-8
. I show no issues on gpu-1-8.
Sorry - gpu-2-8
... typo. Approaching the end of the week ...
I hear you man ;)
I have submitted 40 jobs to hal cluster this morning (03.14). Usually, they all quickly pass the 'Q' status. This time, I see only 8 jobs running at a time while all others are in the Q. I see a total of 173 jobs running on the cluster.
@vkuryavyi please open a git issue directly. Do not add here.
I am going to ask that everyone attempt to check Git first before reporting any problems. All users of Hal should strongly consider following this Git group for a few days as we work out any kinks.
A CONSIDERABLE level of physical change has occurred and a fair level of software update and scheduler config migration to a new host.
Problems will be handled in the order received and also by impact level. No new requests for items will be processed for a bit as we go through any update tasks.