Closed nariai closed 8 years ago
Here are the steps:
If the systems do not have a DVD drive a USB drive should be usable or we could setup network installs which just takes longer so I would encourage the DVD path for that few nodes.
Is this clear?
Thanks, then I guess Hiroko will need to coordinate with SDSC to get cables and connections, and do software configuration. Let's talk about this over the Skype next week.
I have asked SDSC for an update on the 10G fiber they promised last week. That is NOT the same thing. If you need suggested part numbers for 1G SR SFP LC connectors advise.
SDSC claims they will have parts next week. They forgot our order.
My suggestion is once the 10G request is complete that we start a new ticket entry to see if we can get this 1G link. It might be confusing if done at the same time.
1G request being tracked as its own Git due to SDSC component. #51
Hi Paul, we are now ready to start migrating compute nodes from the old cluster to the new cluster. Can you start the process? If there is something that we need to do on-site, please let us know. If you think it's better to have a face-to-face meeting via Skype, please let me know.
The initial task is in your hands for this one.
Basically take a gander at "step 2" up a few.
You will need to download a CentOS 6.6 DVD which I can provide a link to and then pick a node to work on the steps with me. Basically it will be performing a very basic CentOS install, assigning the same address the unit had before and doing what is called a "minimal" install.
But it will ask you a few items such as partition sizes I can review with you on the first one and then just repeat the process.
Once its on the network I take over and puppet does the rest.
When can you start with the DVD on a starter node?
Grab the following DVD and let me know when you want to go through the steps. Unless by some chance the nodes have an IPMI KVM or something I will only be able to describe these steps to you.
ftp://ftp.uci.edu/mirrors/centos/6.6/isos/x86_64/CentOS-6.6-x86_64-bin-DVD1.iso
Note clearly this is based on a belief that @hirokomatsui confirmed the nodes all have DVD drives. You will need to attach a keyboard and monitor to the machines but SDSC has lots of those.
I see some IP addresses for IPMI but I don't seem to be able to connect to them. Probably this would make sense for a Skype to review the options and quickest way forward.
I will try to work on tomorrow, otherwise on Friday. I think I'm ok to install the CentOS and get the network connection by myself. I'll let you know when tomorrow after fixing my schedule.
Just to be very clear: you do not have to do any changes to the cabling of the network connection to the node. Just remember the IP address of the one you pick to re-install.
Be sure BTW that the nodes contain no data you wish to save. CentOS will erase it.
Happy to help, just let me know the best method to do so given the noise of the datacenter.
I'll start working at SDSC 10am tomorrow, installing CentOS on cn19.
Sounds good. I have an appointment at 3:00PM (1:00PM your time) but I will do my best to help. I am assuming cn19 will come back up on the current IP:
10.0.0.19/16
You will probably want to set default gateway via a new cluster head node. I would go with:
10.0.16.10 (thats hn1)
And if you wish to enter DNS data for starters:
Domain=local nameserver 10.0.16.10
But I can handle all that with the puppet run.
Thanks Paul. BTW, what is your schedule on the next week? We want to move all the old compute nodes to the new cluster before the holiday if it's possible. Please let us know. I'm working at office until next Wednesday.
I am working the same and happy to help. I will likely take off Thursday and Friday. I will be working the week after.
I'm installing CentOS on cn1 instead of cn19. cn12-19 don't have DVD drive, which we bought later than others.
OK. If you need my input on any of the install questions, my line is clear: (608)-271-6817
Do cn12-cn19 have USB ports?
If so, starting with 6.5 you can burn the ISO to a big enough USB drive. https://wiki.centos.org/HowTos/InstallFromUSBkey
We can copy the whole disk image of cn1 to the others using ganglia which is a little modified by Microway: https://flc.ucsd.edu/mcms/ I'll check it later
cn1 is connected. I'll work on the other stuff there and start copying the disk.
I am on cn1 as root and checking it.
I'm not sure what disk copy method they are using in that link.
But if that works, sounds reasonable with the following warning:
Be very careful copying images with configured network interfaces or they will conflict. You will have to change the config of a copied image in a few places.
So far cn1 looks good however.
Please reboot cn1 when convenient to remove the network manager component. I made the config file change but it requires a reboot and it appears to have dropped. I assume something is being done.
cn1 doesn't have connection for now?
No. It appears to have dropped. I thought you were working on it. Can it be reset?
rebooted!
Thanks. I must have fat fingered something. My apologies. Back on the machine and working on Lustre.
Do you want me to hold however as it will be adding a bunch of stuff that probably is unique to the final system....
No, I will leave from the cn1 for now. The old cluster has a system to back up a node, and copy it to the others. Can you set up the cn1 to migrate to the new cluster, SGE etc if needed. Then, I'll copy it.
That really isn't going to work. Unique keys and such are created when you deploy many of the items I add. Imaging isn't going to handle that process.
Basically without knowing anything about this imaging process I would recommend you do the basic installs and let puppet do the heavy lifting.
Its a pain I know....
Or if you do want to try one, do it before I add all the items I feel will be needing reinstall if done via clone. Aka now.
OK, I'll start copying. We've been using the system several times when updating OS.
Give it a try, but I know Puppet, SGE and Lustre make unique items per node that would be best if done in a cloning environment to be done "after clone". Because otherwise I'll just be reinstalling each item.
If that makes sense.
OK. Shortly I will have to leave for an appointment. I am waiting on Advanced HPC to install the replacement drive. Do you need anything else from me until the nodes are all converted to CentOS?
After talking with the other, I'll install CentOS to the other nodes instead of copying cn1. So please go a head to modify cn1.
I will research a bit if I can install CentOS remotely from NFS mounted drive. If I cant, will do onsite next week.
The CentOS network install process is called Kickstart and I could set up a server for it on one of the head nodes. But I always do it from a CentOS machine.
But its a bit complex and would require you work with me from a system console to make sure what is called PXE is booted from (basically you boot from a network card) and that it performs the proper steps after talking to the Kickstart server.
The time this would take is not overly long but may delay getting things done if we have issues compared to the plain old DVD method.
But given there are 18 more systems I'm willing to give it a go if you can contact me Monday from a node...up to you.
Or I was thinking to modify Grub boot loader. But probably installing onsite will be easiest way than doing all the research. I'm looking at the way to install CentOS via hard drive, which will be NFS mounted ISO image. In that way, I don't have to push the DVD and USB one by one.
Basically I'd like to chat when possible in the morning. I'm very familiar with the options and I think we should at least try a network install process tomorrow. I agree for 20 systems it is probably faster.
But (and this is why I want to talk) the trick is to make sure we coordinate the process of a PXE boot on one unit first and make sure it is correct. (Before we do the rest).
That is a bit tricky remotely as that datacenter is loud and you will need to do a network boot of a node which is usually in the BIOS POST process. But if we can attempt one the rest should go fairly fast.
I can start the process of setting up a kickstart server in the morning and when you get in perhaps we can do cn2 as a test. Sound reasonable?
One question for the morning:
There is a DHCP server on flc with some partial config for two hosts (cn1/cn2). I would like to shut that down (and use DHCP on the new kickstart server) but if you are aware of other systems that need DHCP on your backend network we should be careful.
I'll have the rest of the config ready by morning.
I have incorporated your choices from the load on cn1 for the kickstart file. I've only changed the /home
partition to be a /scratch
partition. As we'll Lustre mount the homedirs.
Oh, and BTW once the dust clears a bit I am more than happy to explain the entire method involved. Just trying to get these nodes up before the holidays.
I am working on the DHCP component this morning and from my reading of the motherboard manual the nodes have I believe I can walk you through a fairly simple PXE (network) boot. There may be a simple hot key to do a one time boot from the network or we may have to enter the BIOS to set the network card as the first boot.
Regardless I am available most of the day although I may have to take a small break to attend an office party.
But I'd really like to try to get cn2 attempted via the method so the others could be done based on its results.
It's ok to stop the DHCP server on flc. Please let me know the direction whatever I need to do.
OK. My suggestion is we chat for a few minutes because the process will involve a trip to the datacenter and doing a PXE boot on a node. Do you have time for that now?
Quickest to just call my desk: (608)-271-6817 or I can call you...
Dhcp is running now on fl-hn1 which is the kickstart server. When ready to attempt cn2 from console should be close to correct...
running to SDSC...
Paul,
As you can see, our current bottleneck is the number of nodes in the new cluster system, and hence we want to move a part of nodes in the old cluster to the new cluster as soon as possible (maybe eight nodes out of 16 nodes, as the first step).
Can you tell us what need to be done before migrating nodes?
Naoki