frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

Migrate nodes from the old cluster to the new cluster #33

Closed nariai closed 8 years ago

nariai commented 8 years ago

Paul,

As you can see, our current bottleneck is the number of nodes in the new cluster system, and hence we want to move a part of nodes in the old cluster to the new cluster as soon as possible (maybe eight nodes out of 16 nodes, as the first step).

Can you tell us what need to be done before migrating nodes?

Naoki

tatarsky commented 8 years ago

Here are the steps:

  1. We must get SDSC to bring a 1G fiber drop between your old cluster rack and new rack CC36. This will require these sub-components: -A fiber cable which SDSC will have to provide I believe as the length of it is unclear. -Two SR 1G SFP LC connector optics which we will have to purchase -We will connect these options between I believe the Cisco Small Business switch in your old cluster Rack and the new 1G switch in rack CC36 -Given the network ranges are THE SAME we will want to closely check that my efforts to have Advanced HPC not conflict in IP addresses worked as planned.
  2. My suggestion is then for any number of node systems you wish to migrate the quickest path is to manually using a CentOS 6 DVD load what is called "minimal" CentOS 6 and assign the same IP address as before. Once the system can be reached as root, I will run the puppet steps (Which I am working on) to convert it from that minimal state to a Lustre Client and SGE Node.

If the systems do not have a DVD drive a USB drive should be usable or we could setup network installs which just takes longer so I would encourage the DVD path for that few nodes.

Is this clear?

nariai commented 8 years ago

Thanks, then I guess Hiroko will need to coordinate with SDSC to get cables and connections, and do software configuration. Let's talk about this over the Skype next week.

tatarsky commented 8 years ago

I have asked SDSC for an update on the 10G fiber they promised last week. That is NOT the same thing. If you need suggested part numbers for 1G SR SFP LC connectors advise.

tatarsky commented 8 years ago

SDSC claims they will have parts next week. They forgot our order.

tatarsky commented 8 years ago

My suggestion is once the 10G request is complete that we start a new ticket entry to see if we can get this 1G link. It might be confusing if done at the same time.

tatarsky commented 8 years ago

1G request being tracked as its own Git due to SDSC component. #51

nariai commented 8 years ago

Hi Paul, we are now ready to start migrating compute nodes from the old cluster to the new cluster. Can you start the process? If there is something that we need to do on-site, please let us know. If you think it's better to have a face-to-face meeting via Skype, please let me know.

tatarsky commented 8 years ago

The initial task is in your hands for this one.

Basically take a gander at "step 2" up a few.

You will need to download a CentOS 6.6 DVD which I can provide a link to and then pick a node to work on the steps with me. Basically it will be performing a very basic CentOS install, assigning the same address the unit had before and doing what is called a "minimal" install.

But it will ask you a few items such as partition sizes I can review with you on the first one and then just repeat the process.

Once its on the network I take over and puppet does the rest.

When can you start with the DVD on a starter node?

tatarsky commented 8 years ago

Grab the following DVD and let me know when you want to go through the steps. Unless by some chance the nodes have an IPMI KVM or something I will only be able to describe these steps to you.

ftp://ftp.uci.edu/mirrors/centos/6.6/isos/x86_64/CentOS-6.6-x86_64-bin-DVD1.iso

tatarsky commented 8 years ago

Note clearly this is based on a belief that @hirokomatsui confirmed the nodes all have DVD drives. You will need to attach a keyboard and monitor to the machines but SDSC has lots of those.

tatarsky commented 8 years ago

I see some IP addresses for IPMI but I don't seem to be able to connect to them. Probably this would make sense for a Skype to review the options and quickest way forward.

hirokomatsui commented 8 years ago

I will try to work on tomorrow, otherwise on Friday. I think I'm ok to install the CentOS and get the network connection by myself. I'll let you know when tomorrow after fixing my schedule.

tatarsky commented 8 years ago

Just to be very clear: you do not have to do any changes to the cabling of the network connection to the node. Just remember the IP address of the one you pick to re-install.

Be sure BTW that the nodes contain no data you wish to save. CentOS will erase it.

Happy to help, just let me know the best method to do so given the noise of the datacenter.

hirokomatsui commented 8 years ago

I'll start working at SDSC 10am tomorrow, installing CentOS on cn19.

tatarsky commented 8 years ago

Sounds good. I have an appointment at 3:00PM (1:00PM your time) but I will do my best to help. I am assuming cn19 will come back up on the current IP:

10.0.0.19/16

You will probably want to set default gateway via a new cluster head node. I would go with:

10.0.16.10 (thats hn1)

And if you wish to enter DNS data for starters:

Domain=local nameserver 10.0.16.10

But I can handle all that with the puppet run.

hirokomatsui commented 8 years ago

Thanks Paul. BTW, what is your schedule on the next week? We want to move all the old compute nodes to the new cluster before the holiday if it's possible. Please let us know. I'm working at office until next Wednesday.

tatarsky commented 8 years ago

I am working the same and happy to help. I will likely take off Thursday and Friday. I will be working the week after.

hirokomatsui commented 8 years ago

I'm installing CentOS on cn1 instead of cn19. cn12-19 don't have DVD drive, which we bought later than others.

tatarsky commented 8 years ago

OK. If you need my input on any of the install questions, my line is clear: (608)-271-6817

tatarsky commented 8 years ago

Do cn12-cn19 have USB ports?

tatarsky commented 8 years ago

If so, starting with 6.5 you can burn the ISO to a big enough USB drive. https://wiki.centos.org/HowTos/InstallFromUSBkey

hirokomatsui commented 8 years ago

We can copy the whole disk image of cn1 to the others using ganglia which is a little modified by Microway: https://flc.ucsd.edu/mcms/ I'll check it later

hirokomatsui commented 8 years ago

cn1 is connected. I'll work on the other stuff there and start copying the disk.

tatarsky commented 8 years ago

I am on cn1 as root and checking it.

I'm not sure what disk copy method they are using in that link.

But if that works, sounds reasonable with the following warning:

Be very careful copying images with configured network interfaces or they will conflict. You will have to change the config of a copied image in a few places.

So far cn1 looks good however.

tatarsky commented 8 years ago

Please reboot cn1 when convenient to remove the network manager component. I made the config file change but it requires a reboot and it appears to have dropped. I assume something is being done.

hirokomatsui commented 8 years ago

cn1 doesn't have connection for now?

tatarsky commented 8 years ago

No. It appears to have dropped. I thought you were working on it. Can it be reset?

hirokomatsui commented 8 years ago

rebooted!

tatarsky commented 8 years ago

Thanks. I must have fat fingered something. My apologies. Back on the machine and working on Lustre.

Do you want me to hold however as it will be adding a bunch of stuff that probably is unique to the final system....

hirokomatsui commented 8 years ago

No, I will leave from the cn1 for now. The old cluster has a system to back up a node, and copy it to the others. Can you set up the cn1 to migrate to the new cluster, SGE etc if needed. Then, I'll copy it.

tatarsky commented 8 years ago

That really isn't going to work. Unique keys and such are created when you deploy many of the items I add. Imaging isn't going to handle that process.

tatarsky commented 8 years ago

Basically without knowing anything about this imaging process I would recommend you do the basic installs and let puppet do the heavy lifting.

tatarsky commented 8 years ago

Its a pain I know....

tatarsky commented 8 years ago

Or if you do want to try one, do it before I add all the items I feel will be needing reinstall if done via clone. Aka now.

hirokomatsui commented 8 years ago

OK, I'll start copying. We've been using the system several times when updating OS.

tatarsky commented 8 years ago

Give it a try, but I know Puppet, SGE and Lustre make unique items per node that would be best if done in a cloning environment to be done "after clone". Because otherwise I'll just be reinstalling each item.

If that makes sense.

tatarsky commented 8 years ago

OK. Shortly I will have to leave for an appointment. I am waiting on Advanced HPC to install the replacement drive. Do you need anything else from me until the nodes are all converted to CentOS?

hirokomatsui commented 8 years ago

After talking with the other, I'll install CentOS to the other nodes instead of copying cn1. So please go a head to modify cn1.

hirokomatsui commented 8 years ago

I will research a bit if I can install CentOS remotely from NFS mounted drive. If I cant, will do onsite next week.

tatarsky commented 8 years ago

The CentOS network install process is called Kickstart and I could set up a server for it on one of the head nodes. But I always do it from a CentOS machine.

But its a bit complex and would require you work with me from a system console to make sure what is called PXE is booted from (basically you boot from a network card) and that it performs the proper steps after talking to the Kickstart server.

The time this would take is not overly long but may delay getting things done if we have issues compared to the plain old DVD method.

But given there are 18 more systems I'm willing to give it a go if you can contact me Monday from a node...up to you.

hirokomatsui commented 8 years ago

Or I was thinking to modify Grub boot loader. But probably installing onsite will be easiest way than doing all the research. I'm looking at the way to install CentOS via hard drive, which will be NFS mounted ISO image. In that way, I don't have to push the DVD and USB one by one.

tatarsky commented 8 years ago

Basically I'd like to chat when possible in the morning. I'm very familiar with the options and I think we should at least try a network install process tomorrow. I agree for 20 systems it is probably faster.

But (and this is why I want to talk) the trick is to make sure we coordinate the process of a PXE boot on one unit first and make sure it is correct. (Before we do the rest).

That is a bit tricky remotely as that datacenter is loud and you will need to do a network boot of a node which is usually in the BIOS POST process. But if we can attempt one the rest should go fairly fast.

I can start the process of setting up a kickstart server in the morning and when you get in perhaps we can do cn2 as a test. Sound reasonable?

tatarsky commented 8 years ago

One question for the morning:

There is a DHCP server on flc with some partial config for two hosts (cn1/cn2). I would like to shut that down (and use DHCP on the new kickstart server) but if you are aware of other systems that need DHCP on your backend network we should be careful.

I'll have the rest of the config ready by morning.

tatarsky commented 8 years ago

I have incorporated your choices from the load on cn1 for the kickstart file. I've only changed the /home partition to be a /scratch partition. As we'll Lustre mount the homedirs.

tatarsky commented 8 years ago

Oh, and BTW once the dust clears a bit I am more than happy to explain the entire method involved. Just trying to get these nodes up before the holidays.

tatarsky commented 8 years ago

I am working on the DHCP component this morning and from my reading of the motherboard manual the nodes have I believe I can walk you through a fairly simple PXE (network) boot. There may be a simple hot key to do a one time boot from the network or we may have to enter the BIOS to set the network card as the first boot.

Regardless I am available most of the day although I may have to take a small break to attend an office party.

But I'd really like to try to get cn2 attempted via the method so the others could be done based on its results.

hirokomatsui commented 8 years ago

It's ok to stop the DHCP server on flc. Please let me know the direction whatever I need to do.

tatarsky commented 8 years ago

OK. My suggestion is we chat for a few minutes because the process will involve a trip to the datacenter and doing a PXE boot on a node. Do you have time for that now?

Quickest to just call my desk: (608)-271-6817 or I can call you...

tatarsky commented 8 years ago

Dhcp is running now on fl-hn1 which is the kickstart server. When ready to attempt cn2 from console should be close to correct...

hirokomatsui commented 8 years ago

running to SDSC...