frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

CentOS 7 userspace software checks #201

Closed tatarsky closed 7 years ago

tatarsky commented 7 years ago

This Git issue will track the needs and changes to the environment from a CentOS 7 system. Several "system" versions of tools are updated in CentOS 7 and may obsolete some module needs.

As I prepare a system cn12 for some initial tests people can shortly ssh to the system (no SGE queues or similar at the moment) to run basic tests. I will update this then with any needed rebuilds or module work.

The /frazer01 Lustre is NFS mounted as it is on all old nodes. Do not expect high performance.

tatarsky commented 7 years ago

If people want to ssh from a head node to cn12 and report any missing or non-functional items feel free. I'll be dealing with many small items at the RPM perl level where many items changed name, but I'm plugging along.

Of greater interest would be software in modules that no longer execute and would require CentOS 7 native builds to proceed.

tatarsky commented 7 years ago

Is there any feedback on this system/CentOS 7 or do we need a larger node converted?

joreynajr commented 7 years ago

Hi Paul,

I'll start testing some of our pipelines and get back to you about the results.

joreynajr commented 7 years ago

On that note I think it would be a good idea for everyone who is maintaining frequently used pipelines to try them out. @frazer-lab/members

tatarsky commented 7 years ago

Cool. Just remember CN12 is an "old node" and as such really only has 1Gbit of bandwidth to the filesystem. So don't compare speeds of I/O ;)

If we need a Lustre 10Gbit connected node converted we can pursue that but wanted to get basic checks first for any "show stoppers"

joreynajr commented 7 years ago

Good reminder. Thanks Paul!

joreynajr commented 7 years ago

Actually I do have a question. Is there a way of specifically choosing cn12 as the host of our job?

joreynajr commented 7 years ago

Nevermind. I just re-read the previous messages. I'll ssh and run the jobs manually. Should be easier to track this way as well.

tatarsky commented 7 years ago

Yep. Once I get the Lustre backend updated I will start on SGE for CentOS 7. Which has some "details" we'll want to discuss. Basically if you find things missing just add them to this Git and we'll go from there. I did a basic "Puppet" rig for the C7 effort and can extend it, saving us time for the additional nodes.

tatarsky commented 7 years ago

Thanks BTW for testing!

tatarsky commented 7 years ago

I am working on the details for SGE and C7.

tatarsky commented 7 years ago

CN12 will be moved to CentOS 7.2 shortly. To provide a more exact reference system for the update path now that the backend is done with its upgrades that include that version kernels.

tatarsky commented 7 years ago

This update is now in progress. Avoid cn12 ;)

tatarsky commented 7 years ago

Cn12 is now the same C7 version as the updated backend supports. Should be no huge changes but one less "difference" as we proceed. I will see if I can get a queue for it to test some SGE.

tatarsky commented 7 years ago

Probably should have placed the update I just did in #199 here. But basically there is a c7 queue now.

qsub -l c7 foo.sh

Its only cn12 for now. But may help debug scheduled items on c7. Advise of problems.

joreynajr commented 7 years ago

Hi Paul,

Two of my pipelines to check for any broken code. Below I'm going to list them all out and what happened during execution (Note: many programs ran without errors but I figured it would be good to list everything):

bedGraphToBigWig Errors There is some package named libpng12.so.0 which is having an issue. The full error message is:

bedGraphToBigWig: error while loading shared libraries: libpng12.so.0: cannot open shared object file: No such file or directory

rsem rsem-calculate-expression Errors

Can't locate Env.pm in @INC (@INC contains: /frazer01/software/rsem-1.2.20 /frazer01/home/joreyna/software/vcftools_0.1.13/perl/ /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /frazer01/software/rsem-1.2.20/rsem-calculate-expression line 10.
BEGIN failed--compilation aborted at /frazer01/software/rsem-1.2.20/rsem-calculate-expression line 10.
tatarsky commented 7 years ago

This is good news! I'm looking at "compat" versions of libpng on the system first so you don't have to recompile (which would then break the binary for C6)

Do you have the full paths to your bedGraphToBigWig binary? I can search for it but if you have it handy an "ldd" on it will confirm/deny if I've found the right compat lib.

The perl one I will see what the options are for that. I assume the system perl from that @INC.

tatarsky commented 7 years ago

Loaded libpng12 (on cn12/C7 box only) which should provide that older version of the library for backwards compat. Confirm it does or full path to binary and I'll look more closely. (Will add to puppet after you confirm)

If you recompile be aware that the -devel kit on C7 is libPNG 1.5.13.

Looking for Env.pm...

joreynajr commented 7 years ago

The full path to bedGraphToBigWig is: /software/ucsc.linux.x86_64.20151103/bedGraphToBigWig

and the ldd gave me:

        linux-vdso.so.1 =>  (0x00007ffee2f3e000)
        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003f1d600000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003f18600000)
        libpng12.so.0 => /usr/lib64/libpng12.so.0 (0x0000003f22e00000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003f19200000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003f18a00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003f18200000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003f1ee00000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x0000003f1c200000)
        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x0000003f1ea00000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003f1ba00000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003f1a200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003f17e00000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003f19a00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003f18e00000)
tatarsky commented 7 years ago

Yep, I think you are good to go now that I've filled in that "libpng12.so.0 => /usr/lib64/libpng12.so.0 (0x0000003f22e00000)" part ;)

Tell me if the binary actually works.

joreynajr commented 7 years ago

I'm going to give it a try. Be right back :).

tatarsky commented 7 years ago

I believe I also added the perl-Env package. No rush. I'm catching this during a brief break before the next leg of my roadtrip. Just wanted to keep the ball rolling and note my appreciation of the pipeline tests!

joreynajr commented 7 years ago

Actually I will have to get back to you tomorrow. Unfortunately the script that uses bedGraphToBigWig also deletes it's input which takes some time to generate :/. On the bright side I will shoot off the entire pipeline using the new queue which removes the hassle of checking each manually.

tatarsky commented 7 years ago

All good. I will be back online most likely tomorrow night and I'll check on things! I appreciate the testing. The more of this we do the less chaos the C6->C7 efforts will be when I return. The goal is NOT to force everything into "recompile" mode. I've had pretty good luck with that in a few other migrations particularly where C5 (CentOS 5) was never in the mix.

joreynajr commented 7 years ago

The lab is producing a lot of data that requires these pipelines so I really need to keep them going. Thanks for your help. I definitely would have been stumped by libpng12.so.0 => /usr/lib64/libpng12.so.0 (0x0000003f22e00000) :P. I'll get back to you tomorrow and keep enjoying the road trip!

joreynajr commented 7 years ago

Hi Paul,

I found a mistake with how I was running the pipeline. Sorry but I won't be able to get back to you until tomorrow.

tatarsky commented 7 years ago

Absolutely no rush from me! Greets from Montreal!

joreynajr commented 7 years ago

Hi Paul,

Seems like RSEM is now working. Not sure if this is just an anomaly or you got a chance to fix the Perl issue. There is one more bug I've found in my pipeline dealing with an R error.

ERROR 2017-07-06 17:21:37 ProcessExecutor /software/R-3.2.2-cardips/lib64/R/bin/exec/R: error while loading shared libraries: libicuuc.so.42: cannot open shared object file: No such file or directory.

Thanks for the help!

tatarsky commented 7 years ago

In transit home. Will look shortly.

Paul Tatarsky paul@clusterguys.com

On Jul 10, 2017, at 1:30 PM, Joaquin notifications@github.com wrote:

Hi Paul,

Seems like RSEM is now working. Not sure if this is just an anomaly or you got a chance to fix the Perl issue. There is one more bug I've found in my pipeline dealing with an R error.

ERROR 2017-07-06 17:21:37 ProcessExecutor /software/R-3.2.2-cardips/lib64/R/bin/exec/R: error while loading shared libraries: libicuuc.so.42: cannot open shared object file: No such file or directory.

Thanks for the help!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

tatarsky commented 7 years ago

This library version (.42) is no longer available in CentOS 7 as a "compat" library. C7 supports a .50 version only.

So I will begin something I do to handle such things in a transition without a re-compile (a special "old library" ldconfig area) today. Wait for the all clear on that before re-test. Give me a moment to catch up from vacation and I should have that for you this morning.

In the end we will want to re-compile but that would break it for C6. I will document items I know are using the older libraries for when we are "C6 free"

tatarsky commented 7 years ago

OK. This workaround I describe is in place for the .42 libicu libraries on cn12 and will be in puppet shortly for all migrated systems.

The above binary ldd is now clean. Can you confirm it works properly? (basic test of running that R worked...)

joreynajr commented 7 years ago

Hi Paul,

I'm trying it out. I'll let you know the results soon.

tatarsky commented 7 years ago

Sounds good. We should probably consider a "fast node" conversion to C7 for you shortly. Once we iron out a few of these shared lib issues would make it easier for others to test with real data.

joreynajr commented 7 years ago

I think that would be a good idea. There are other people with pipelines that might break during the conversion. @hurleyLi are you going to test your WGS, ChIP-seq and Hi-C pipelines?

tatarsky commented 7 years ago

Yep. Basically the way I usually do this is with a small handful of C7 systems so those that want to test can get faster results. Then we convert a head node. Probably fl-hn2.

Then we have to decide the rate of cutover of the "main queues" compared to trying to make "C7" versions of them. Sometimes I make a separate SGE "cluster" for this process. But given the relatively small number of nodes that may not make sense. We'll probably schedule a chat for this decision.

But you are doing the frankly critical first step of seeing if anything is going to really badly break in C7. I'm going to be surprised if that is the case but well worth small system count tests in my experience.

If it works (even slowly) on one system its going to work on more systems.

joreynajr commented 7 years ago

That did the job! Thanks Paul!

tatarsky commented 7 years ago

Cool. So basically the same methods I used last C6->C7 seem to be OK so far. The main obstacle in clusters involves binaries from really old distros like C5 often require libc changes to function. I don't think we have any of that. Or binaries compiled for 32-bit architectures.

I'll proceed on more C7 nodes for test per #199 . The more we do validations however, the less pain the transition will be.

joreynajr commented 7 years ago

I think you're right about migrating more nodes. It can take a while to test on a single node and people in the lab prioritize their projects over testing code (which is understandable but makes the C6 -> C7 migration more difficult). Overall, migrating more nodes makes testing more accessible and easier to monitor.

joreynajr commented 7 years ago

We still need to check up with people but I don't think anyone is running any large pipelines as of right now.

tatarsky commented 7 years ago

Sounds good. I'll assume my proposal to convert fl-n1-1 to C7 in #199 is OK if I don't hear objections soon. Won't get to it today though...fighting some issues elsewhere.

hirokomatsui commented 7 years ago

Hi Paul, I'm wondering how we update the directories: /software and /repos

I think ideally to move them and create complete new ones for CentOS7. Maybe I can ask people which software and module files they will be using, and people can use C7 more smoothly.

tatarsky commented 7 years ago

So basically I encourage the same basic sort of tests already being done. Most of the C6->C7 migrations I've done not too many C6 items broke completely on C7. And we were able to only migrate items that broke and continue using binaries for ones that didn't.

This greatly eases the process. But requires testing similar to the efforts above.

Not sure why you would alter repos....do you have a reason for that?

I can also however configure the C7 modules package to look in a different location for modules if we find several that do not work.

hirokomatsui commented 7 years ago

OK that sounds good. I thought repos have to be updated if we move /software, but it doesn't matter if we don't have to move.

Could you remind me your schedule for the update? I think many people are not responding much on this topic since they are busy for their paper. I will let them know again.

tatarsky commented 7 years ago

My schedule I view as controlled by your groups schedule. Aka if you are busy, I don't wish to make it more busy. I am happy to go as fast as desired or migrate carefully when time permits. I am around now solidly until the end of the year. The backend WAS upgraded which was a major effort.

But I will encourage folks to not "linger" with a mixed environment for too long once we convert more than 1-2 nodes. Once we start moving fast nodes to C7 I would encourage folks to rapidly test and then we convert more.

I am happy to have another call on the next steps. But the major win comes from testing on a handful of nodes FIRST so the level of surprise is smaller....

hirokomatsui commented 7 years ago

Currently the heaviest user of the cluster is me, wanting to finish the analysis in a week or so. I'll remind people again, then we can start updating other nodes in next week?

tatarsky commented 7 years ago

Sounds perfect! My goal is low impact migration like the last few I've done and I'm just repeating the playbook from those.

tatarsky commented 7 years ago

@hirokomatsui per email just drop me a note when you are done with the analysis and want to consider update of some fast nodes for further testing. I'm noting this here just for status.

hirokomatsui commented 7 years ago

Just in the case you care, I've installed following 12 packages (all about X) on cn12 to make SpeedSeq on it:

expat-devel-2.1.0-8.el7.x86_64.rpm libxcb-devel-1.11-4.el7.x86_64.rpm fontconfig-2.10.95-10.el7.x86_64.rpm libXext-devel-1.3.3-3.el7.x86_64.rpm fontconfig-devel-2.10.95-10.el7.x86_64.rpm libXft-devel-2.3.2-2.el7.x86_64.rpm libX11-1.6.3-3.el7.x86_64.rpm libXpm-devel-3.5.11-3.el7.x86_64.rpm libX11-devel-1.6.3-3.el7.x86_64.rpm libXrender-devel-0.9.8-2.1.el7.x86_64.rpm libXau-devel-1.0.8-2.1.el7.x86_64.rpm xorg-x11-proto-devel-7.7-13.el7.noarch.rpm

I'll re-make it on the head node once it's updated. I expect those packages will be installed on the head nodes but compute nodes.

tatarsky commented 7 years ago

Fully understood and will save that list for the puppet manifest.

hirokomatsui commented 7 years ago

The packages don't have to be installed on the compute nodes, since they are needed only to make SpeedSeq which will be done on the head node.