frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

CentOS 7 efforts tracking #199

Closed tatarsky closed 7 years ago

tatarsky commented 7 years ago

This Git will track the plans to move to CentOS 7 in a phased way on the cluster.

Currently it will consist mostly of me checking on Lustre update options.

tatarsky commented 7 years ago

AHPC ticket opened to get Intel patches for either latest IEEL 2.3 or consider IEEL 3.X. Investigating the impact of moving to 3.X so its one less thing to update later.

Believe we could deploy a CentOS 7.0 client anytime with current distribution. (Note carefully 7.0 is the first release of CentOS)

tatarsky commented 7 years ago

Still working on details. Possibly requires filesystem stop to update but reviewing release notes.

tatarsky commented 7 years ago

Moving to IEEL 3.X is a large project and probably more than we need at first. It involves a complete burn of the Lustre servers to CentOS 7 and as such there is also risk. So I would view this step as "further" than we seek at this phase and certainly not likely before we better implement some form of backup. (See #200 and perhaps other discussions).

I have asked Intel whats the most recent CentOS 7 client I can run using our existing IEEL 2.3 branch of the code if fully updated. As I feel that makes more sense at the moment in terms of risk/reward.

hirokomatsui commented 7 years ago

Thanks. We'll go for CentOS 7 with IEEL 2.3.

tatarsky commented 7 years ago

Yep. 100% on the same page there. I'm just waiting for a response as to what "revision" of CentOS 7 we can do with patches. I believe we could deploy a CentOS 7.0 (initial release) now but want to see if I can get 7.2 as its more refined than 7.0 was.

Note clearly the Lustre servers remain 6.X with IEEL 2.3. Which I think is fine. No user access to those systems.

I'm likely going to propose as a test first one of the old nodes gets kickstarted. But lets wait to see what Intel says....

tatarsky commented 7 years ago

Answering part of my own question by digging around in the CentOS archives.

The IEEL version we have now has client kernels that match CentOS 7.1.1503. Which isn't too bad. So I will likely grab that ISO and prep a Kickstart area while I wait for Intel response.

tatarsky commented 7 years ago

We are discussing the details of IEEL release 2.4.X.X which we believe we can hop to safety (more soon after I digest the release notes) and provides CentOS 7.2 client side support. Which is the latest CentOS IEEL supports in all versions. More details as I read and potential impact statement.

Still working in parallel on 7.1 test system deployment. Probably not the fl-hn2 headnode yet though.

tatarsky commented 7 years ago

I am prototyping the CenOS 7 kickstart on old node cn12. I have removed it from the queues. Once I confirm a few more update path items there I will start talking schedules.

tatarsky commented 7 years ago

Actually quick check with @hirokomatsui. I am always of the belief that nobody is leaving anything important on nodes /scratch areas. I show that to be the case on cn12 at the moment but if anyone feels otherwise please inform me. cn12 will be completely erased after I eat lunch.

hirokomatsui commented 7 years ago

Yes, all the drives on the compute nodes can be cleared up.

tatarsky commented 7 years ago

OK. This is just to start some tests of "applications" as the old nodes don't really mount Lustre directly but could serve as places to see if there will be considerable impact to needed software.

I am doing a few items in parallel.

tatarsky commented 7 years ago

Status only: working on some puppet basics to configure the system.

tatarsky commented 7 years ago

Forking software validation conversation into #201

tatarsky commented 7 years ago

I will likely perform the first step of the backend Lustre server update today: the update of the management server and its IEEL software.

I believe this can be done without filesystem interruption but I do need to reboot the system and it currently also serves the SGE job spool. I believe the outage will be brief. And I don't show much activity in the jobs.

If you have something vital going on this morning where the above sounds too risky, please let me know.

tatarsky commented 7 years ago

I am about to reboot that system.

tatarsky commented 7 years ago

This is now done. The next steps involve one clarification question to Intel on if the filesystem MUST be stopped and unmounted when redundant servers are involved. The release notes are not clear if its a hard requirement in that case so I have asked for elaboration.

I have confirmed that CentOS 7.2 clients are fully supported in version 2.4.2.7 and am now grabbing that distro ISO for prep for additional tests. I may ask for one of the new nodes to be used for that test so please be sure to check on #201 first in case we run into severe migration needs at the userspace level.

hirokomatsui commented 7 years ago

That sounds good. I see cn12 is running fine with Centos 7.1

tatarsky commented 7 years ago

Yes, although thats not really a Lustre test. The old nodes are not on the 10G. As soon as I get one more Intel answer I will probably offline Node 14 or something (new nodes) for a CentOS 7.2 test.

cn12 I will likely move to CentOS 7.2 shortly just to be consistent.

Its main purpose is a place for people to ssh to per #201 and see what level of impact the upgrade would create for their modules and code.

tatarsky commented 7 years ago

My estimate for downtime IF Intel confirms the filesystem must be shutdown for upgrades regardless of redundancy is a morning (4 hours). I can schedule it early my time if that becomes the case but all users and jobs and notebooks would have to exit. Again, this is an estimate not yet confirmed to be needed for possible discussion.

hirokomatsui commented 7 years ago

Probably in your morning time will be better for shutting down. Please let me know as you schedule, I'll make sure to let the people know.

tatarsky commented 7 years ago

Hoping Intel says "not really needed for redundant configs" like we have.

tatarsky commented 7 years ago

Intel has provided me a procedure but with some caveats that they consider the "shut down the filesystem" a more supported and safer course of action than doing it in redundant pairs.

I will read it again in the morning to see more fully if I agree, but I think given we don't really have backups we should probably stick with the safer course of action for this step is the way I'm reading it tonight.

So folks might mull a good morning in the next few days.

tatarsky commented 7 years ago

My "coffee" mulling is it would be wise given support and lack of backups to do this backend upgrade the officially supported way. That requires filesystem un-mount.

I can do this anytime in the next eight days I will say. After that I am going to go on a vacation (July)

I estimate a morning of complete downtime. (everyone off the systems, all jobs exitted, all notebooks closed)

tatarsky commented 7 years ago

Also during my IEEL upgrade R&D I noted this will be our last "release" probably from Intel itself. The article mentions a release in May 2017 but I've not see signs of it.

https://www.theregister.co.uk/2017/04/18/intel_loses_its_lustre_bins_ownbrand_hpc_filesystem/

So we won't be paying support on Lustre either next year.

So I'd like to get this nice and stable and in the future it will be the community path we take.

tatarsky commented 7 years ago

For my reference: I am monitoring this Github going forward as they claim said release will end up there. https://github.com/intel-hpdd

hirokomatsui commented 7 years ago

What time in the morning are you available to deal with this? Do you have any estimate time that the file system will be unmounted?

tatarsky commented 7 years ago

I can get up nice and early here at 6:00AM or something which is 4:00AM your time.

My estimate remains I would like 4 hours of downtime to complete the tasks I see outlined. (Basically four CentOS 6.6 -> 6.8 updates and the Lustre bump)

Assuming all goes well. I do not guarantee such a downtime but the steps I see seem thats a reasonable time frame.

So it would return in theory around 10:00AM my time which is 8:00AM your time.

Make sense? I figure the morning is more desirable for folks due to the time zone differences.

tatarsky commented 7 years ago

Note clearly: none of this is NODE level changes. Aka this is just the backend Lustre folks. CentOS 7 work requires a gradual process of node and head node changes. This is to give us a good solid Lustre foundation and complete the ability for having CentOS 7.2 clients.

tatarsky commented 7 years ago

I am willing to risk the "live" version BTW if this can't be done. But I would encourage the safer path due to the lack of actual backups. Its not worth a four hour window in my opinion to risk that much data. Given the long uptimes we really need a maintenance window more frequently in the future.

hirokomatsui commented 7 years ago

Do you mean it's still worth to take four hours specifically for this update?

tatarsky commented 7 years ago

I feel four hours is a very short downtime, yes. Well worth the lower risk and official Intel "blessing" that its done the supported way if there was trouble.

tatarsky commented 7 years ago

We BTW crossed 600 days of uptime on many systems today. Thats a long time ;)

hirokomatsui commented 7 years ago

Will you or Intel recommendation come up with any safer way in our situation of no backup if we have longer time frame?

tatarsky commented 7 years ago

Nope.

tatarsky commented 7 years ago

The main thing to remember is this is a patch upgrade (we are remaining 2.X Lustre). Not a major version jump. So I am reasonably confident it should go well. But with ALL filesystems of any complexity there is always some chance of unexpected problems. And for that reason backups are always recommended. From basic disks up to the most complicated configs.

If you would like to pause this process until we have better backups, that is also fine with me. See #200

tatarsky commented 7 years ago

One item of note actually the secondary MDS server burped last night and was offline. (Its done this I believe twice in 600+ days and is one of the reasons I seek to update).

So I did the OS portion of the process on it and timed it. It takes 37 minutes. So given there are now three left I feel my 4 hour estimate is sound and contains room for "issues".

Note that major issues would require me to contact Intel and that could easily extend out of the window.

But overall I remain pretty confident in the details of this from the prep work.

tatarsky commented 7 years ago

The backend is now CentOS 6.8 updated and with the latest 2.X patch release deployed. I feel this makes things more stable for proceeding with node CentOS 7 conversions. Process was a bit exciting to jump multiple versions. I think perhaps every six months we should review the patch level of Lustre going forward rather than > 600 days.

tatarsky commented 7 years ago

OK. I think I have done this right but please test first and then I will make it available from the Jupyterhub config for testing per a @billgreenwald email request.

There is a c7.q with currently only one non-high speed old node in it: cn12

This is to test scheduler based pipeline items on CentOS 7 in at least some way.

qsub -l c7 somescript.sh

--or---

qlogin -l c7

Remember not much I/O to that node. This list of nodes may expand as we move forward.

billgreenwald commented 7 years ago

If you are simply asking for a test that we can qsub to the c7 queue, I just tested a quick echo to a file and it worked. ie i did

echo " echo 'test' > test.test " | qsub -l c7 -cwd

and it made the file as expected and went to the correct queue

If you need a more in depth test, could you clarify what you are looking for?

tatarsky commented 7 years ago

Nope thats good enough (I did some others) and also wanted to "@" you for a coordination of the Hub add. What works for you on that addition we discussed? I show several hubs running and it will require a hub restart....

billgreenwald commented 7 years ago

I will check with people; potentially bringing it down tomorrow morning before people get in should work, but I will post again and confirm.

tatarsky commented 7 years ago

Sure. I have the config line added so basically next convenient restart....

tatarsky commented 7 years ago

(and thanks for coordinating the effort)

billgreenwald commented 7 years ago

Should be good to do in the morning

tatarsky commented 7 years ago

I will do so and do some tests myself! Thank you!

tatarsky commented 7 years ago

Hub restarted. Tested only the Python 3 notebook in my environment that uses the source built python. Others may require testing and additions to system python on that OS. (system python is version 2.7 in CentOS 7). But I believe most folks are using their own python (anaconda) and thus their own iPython engines. So just advise if you think system changes are needed or other debug required!

We can add some more C7 nodes if it helps.

tatarsky commented 7 years ago

I propose converting based on conversations in #201 the fl-n-1-1 node to CentOS 7. It would be removed from the current "regular" queues and added to the test "c7" queue.

The purpose is to speed pipeline tests on a 10G connected system. We can then discuss after those continued valuable tests the rate of final cutover and the choice between a single SGE instance and a split one off a converted head node. (both have some merits...best discussed after pipeline unit tests)

tatarsky commented 7 years ago

Emailed for update on desire to move a fast node to C7. Noting for status only.

hirokomatsui commented 7 years ago

As many people are too busy to test the C7, we are going to take another month, till end of August, for the test. Could you re-schedule the plan for the updating at the beginning of September. I'm expecting to remake some modules, and will ask you when need help.

tatarsky commented 7 years ago

Sounds fine to me. I will mark September as the more likely (still fine to still be busy) start date.

I have some tricks in the module arena when you start thinking about that to "overlay" additional search paths on C7 hosts. (So you don't break the C6 usage during that process)

We can work on that anytime with the cn12 node. Just let me know and have a great Friday!