cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Fuchs-Lab-wide group directory #368

Closed aday00 closed 8 years ago

aday00 commented 8 years ago

Hi HPC Request,

Juan tells me to ask here for a Fuchs-Lab-wide group directory for software installs on the HAL cluster.

We have several libraries to install, including ATLAS 3.10.2, Boost 1.60.0, Boost.Python, glib, glog, gflags, leveldb, and others.

It would be really nice to install as many libraries as possible from RPMs (with modules to load/unload accordingly, which may be of HAL-cluster-wide interest), but if none of that's supported, we'll roll our own installs from source.

Thanks in advance! schaumba@mskcc

tatarsky commented 8 years ago

So per the above I will compare the RPM list to what you list. We do not make modules for RPM loaded items as RPM items are in the default path. I will then compare to the list of modules.

Historically on this cluster most Python needs are handled with personal or group Anaconda trees.

tatarsky commented 8 years ago

So from your request to what is currently available:

ATLAS 3.10.2 -> Atlas RPM 3.8.4 is loaded. (-devel on head node). Advise if you wish a module or maintain yourself. (See below for a comment on the dirs in /cbio/ski/fuchs) Boost 1.60.0 -> Boost 1.41 RPMs are loaded. glib -> RPM glib2 2.22 glog -> Confirm please you mean https://code.google.com/archive/p/google-glog/ No RPM loaded can look gflags -> Confirm http://gflags.github.io/gflags/ No RPM loaded can look

Also checking /cbio/ski/fuchs I do not show the usual dirs were made with your group ownership which I can correct.

The convention is to make a share, projects and nobackup but those were not made. I can do so if you wish to follow that convention or just provide me a name of the dirs you would like.

tatarsky commented 8 years ago

Oh and leveldb same comment. No RPM. Assume https://github.com/google/leveldb

tatarsky commented 8 years ago

I made the usual dirs for now. They are group owned and you can do whatever you desire there. If you wish me to build any of the above as modules, just advise.

tatarsky commented 8 years ago

Oh, and if you build and find a need that an RPM would make easier, I can simply add to puppet if in our repo list.

tatarsky commented 8 years ago

Just checking here. Are you good with your directories or do you wish items built?

aday00 commented 8 years ago

Thanks very much! The goal is to have a shared install of Caffe, which depends on Boost > 1.55 etc http://caffe.berkeleyvision.org/installation.html

Will follow up wrt dirs.

On Thu, Jan 28, 2016 at 10:29 AM, tatarsky notifications@github.com wrote:

Just checking here. Are you good with your directories or do you wish items built?

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-176236627.

jchodera commented 8 years ago

This would be generally useful if it's OK to make your installation publicly available!

tatarsky commented 8 years ago

Happy to make a module pointing to a group maintained item if it helps.

I believe from the above statement from @aday00 that they do intend to maintain this software themselves however which ends I believe need for my time to build items. So I can leave this here or close it until you need a system module entry for your tree.

aday00 commented 8 years ago

Sounds like a Caffe install would help multiple groups, great! I'm happy to not maintain the software myself/ourselves, and believe modules can disentangle the various Boost versions etc so everyone can use this without any pipelines breaking.

On SuSE systems I used to manage, all this was installed to /opt/modules//... and switched via modules. Not familiar with cBio conventions, trust you all will do a fantastic job! :)

On Mon, Feb 1, 2016 at 9:19 AM, tatarsky notifications@github.com wrote:

Happy to make a module pointing to a group maintained item if it helps.

I believe from the above statement from @aday00 https://github.com/aday00 that they do intend to maintain this software themselves however which ends I believe need for my time to build items. So I can leave this here or close it until you need a system module entry for your tree.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-177990679.

tatarsky commented 8 years ago

So to be clear, are you asking me to maintain this software?

Because then I just need an ack from @juanperin he is ok with my time being spent on that or one of his people. I am an hourly contractor.

aday00 commented 8 years ago

This is my first time emailing this list. I don't know how billable hours work, I'm not clear to ask for work that costs money, and in the end if we just had a shared directory, that's probably fine, but will follow up soon wrt dirs. I thought asking for software installs would not cost money. If you tell me what it costs etc, I can forward that to my PI. Thanks! -A

On Mon, Feb 1, 2016 at 1:36 PM, tatarsky notifications@github.com wrote:

So to be clear, are you asking me to maintain this software?

Because then I just need an ack from @juanperin https://github.com/juanperin he is ok with my time being spent on that or one of his people. I am an hourly contractor.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-178115459.

tatarsky commented 8 years ago

Well I speak for my status as a contractor for MSKCC. If you are recharged or not is something I would ask @juanperin . I just note that if I'm doing the work there is indeed a bill and so I want to make sure before I start such work that he is aware of it and can choose other resources if desired. (there are other sysadmins on his team, I am a legacy concept from the SDSC days)

If it was a simple build its usually fine, but this looks a bit elaborate. I'll just wait to hear!

juanperin commented 8 years ago

We are adopting the same charging principles applied to all of our HPC resources that charge monthly maintenance fee’s and storage use fee’s. Maintenance fee’s will include some basic level of support in getting help with permissions, small applications, dependencies, etc… But when installs or dependency issues become more time consuming and unique, we will provide an estimate of the time (when greater then 2 hours) in the form of a quote for larger software installations, etc… When applications do in fact affect more then half of the users on the cluster, these costs will be waived in the interest of the community. When the tools are really satisfying only one lab or group, we ask that they try their best to do this in a shared directory.

If you find yourself having much trouble with Caffe, or other applications and want the HPC group to handle the maintenance, just let us know and we’ll give you guys a proper estimate. Otherwise for now it sounds like you have most of what you need to progress, so we’ll wait to hear back.

Thanks! -Juan

On Feb 1, 2016, at 1:36 PM, tatarsky notifications@github.com wrote:

So to be clear, are you asking me to maintain this software?

Because then I just need an ack from @juanperin https://github.com/juanperin he is ok with my time being spent on that or one of his people. I am an hourly contractor.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-178115459.

aday00 commented 8 years ago

Normal /cbio/ski/fuchs/{nobackup,projects,share} dirs work well, thanks! I think Thomas will meet Christina Leslie and John Chodera on Feb 2 or 3, to discuss the general utility of a shared Caffe install.

On Mon, Feb 1, 2016 at 1:48 PM, juanperin notifications@github.com wrote:

We are adopting the same charging principles applied to all of our HPC resources that charge monthly maintenance fee’s and storage use fee’s. Maintenance fee’s will include some basic level of support in getting help with permissions, small applications, dependencies, etc… But when installs or dependency issues become more time consuming and unique, we will provide an estimate of the time (when greater then 2 hours) in the form of a quote for larger software installations, etc… When applications do in fact affect more then half of the users on the cluster, these costs will be waived in the interest of the community. When the tools are really satisfying only one lab or group, we ask that they try their best to do this in a shared directory.

If you find yourself having much trouble with Caffe, or other applications and want the HPC group to handle the maintenance, just let us know and we’ll give you guys a proper estimate. Otherwise for now it sounds like you have most of what you need to progress, so we’ll wait to hear back.

Thanks! -Juan

On Feb 1, 2016, at 1:36 PM, tatarsky notifications@github.com wrote:

So to be clear, are you asking me to maintain this software?

Because then I just need an ack from @juanperin < https://github.com/juanperin> he is ok with my time being spent on that or one of his people. I am an hourly contractor.

— Reply to this email directly or view it on GitHub < https://github.com/cBio/cbio-cluster/issues/368#issuecomment-178115459>.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-178124273.

tatarsky commented 8 years ago

Sounds like a plan! Just let me know. Leaving open!

aday00 commented 8 years ago

I heard Christina, John, and Thomas support a shared Caffe install. Is this correct? Does it justify installation without charge? Thanks again! -Andrew

On Mon, Feb 1, 2016 at 5:39 PM, tatarsky notifications@github.com wrote:

Sounds like a plan! Just let me know. Leaving open!

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-178235487.

tatarsky commented 8 years ago

OK. I'm getting the nod to build this. I'd like to use a new set of software areas however based on a GPFS area instead of the legacy "/opt" replications on the local drives. I will still of course make modules to handle the details and sub-builds.

Do you have any objection to my just choosing a location for the items to live in that situation?

tatarsky commented 8 years ago

Also, Caffe has a fair chunk of Python and they recommend Anaconda.

The convention on this cluster has been users maintain their own personal python and do various includes or builds themselves to add to it dependencies.

Are you wanting a tree specific Python build for this software or do users intend to merge their own trees as needed? If you don't know specifically what I refer to perhaps a call at some point with the people that intend to use this.

tatarsky commented 8 years ago

To be very clear I intend to follow this suggestion for Python and assume the user has a reasonable personal version unless told otherwise:

To import the caffe Python module after completing the installation, add the module directory to your
 $PYTHONPATH by export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH or the like. 
You should not import the module in the caffe/python/caffe directory!
aday00 commented 8 years ago

No objections. Status here occasionally would be nice if it's not a hassle. Thanks a lot! It's tricky to get this install correct. For instance, as you know, ATLAS tunes itself during installation and has some instructions for disabling CPU throttling. There's also cuDNN to pull from nVidia via free membership, or if you're root you could grab it from (do not post paths). My bin dir has all the dependencies for Caffe, plus other stuff.

I can't make policy changes, and I'm not a Python expert, but Boost.Python is required for Caffe, I believe. I'm not sure how personal pythons interface with Boost.Python, but presumably one could pull in a module so a personal python could use the Boost.Python bindings. So personal pythons as usual I imagine would be fine.

There's also a Matlab interface to Caffe, should just be a configuration option to enable Matlab support in Caffe, and I know for sure there's a Matlab person in lab here. Not sure about other labs. Thanks again very much!

On Wed, Feb 3, 2016 at 4:20 PM, tatarsky notifications@github.com wrote:

To be very clear I intend to follow this suggestion for Python and assume the user has a reasonable personal version unless told otherwise:

To import the caffe Python module after completing the installation, add the module directory to your $PYTHONPATH by export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH or the like. You should not import the module in the caffe/python/caffe directory!

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/368#issuecomment-179474122.

tatarsky commented 8 years ago

I am extremely busy preparing for node updates. I will work on this but it will be awhile. Just setting expectations.

tatarsky commented 8 years ago

What branch of the Git source do you want. I show you built rc2.

tatarsky commented 8 years ago

I'm going to drop some notes as I chew through the prereqs while I do other things.

Boost: Matching your chosen version 1.60.0. Building.

ATLAS: We don't throttle on the Hal cluster. Its disabled and the governor method isn't even in the kernel. Building to a prefix area.

tatarsky commented 8 years ago

Protobuf attempting from RPM at EPEL. 2.3.0. May still use source as I note you did.

tatarsky commented 8 years ago

Bleah. Giving up on RPM. Making protobuf matching version module.

tatarsky commented 8 years ago

2.6.1 for now.

module add protobuf

tatarsky commented 8 years ago

I'm going to take you up on that cuDNN offer. Can that be shared as a module @aday00 ? I'm going to assume it can but if thats somehow tied to a distribution agreement or something can you let me know? Chunking along in general.

tatarsky commented 8 years ago

I guess another question as cuDNN looks like they intend for it to layer into /usr/local/cuda7 would be is that desired....for now I will keep it separate. Just trying to reduce the total LD_LIBRARY_PATH needs

tatarsky commented 8 years ago

going with caffe-rc3

tatarsky commented 8 years ago

Calling it a day but the main summary of Items I could use an answer to are:

  1. is rc3 OK? I show you did rc2 (caffe itself)
  2. Do you need opencv and if so do you need a particular version (you've got a few but the CentOS RPM version is 2.X) (Disabled it for now until I confirm as additional effort)
  3. Misc other stuff above if I forgot ;0

While not all modules and items are done you should be able to see some of the items building up in /cbio/shared/software/

Modules are NOT done for some items. I've manually added items to the Caffe Makefile.

I'm hitting at the moment an NVCC float128 undefined error which I know I've seen before but its been a long day so will try some more tomorrow.

tatarsky commented 8 years ago

note to self: https://svn.boost.org/trac/boost/attachment/ticket/11852/float128GccNvcc.patch

tatarsky commented 8 years ago

Note to self as I've hit this. It appears the default gcc version causes this issue. https://github.com/BVLC/caffe/issues/1398

tatarsky commented 8 years ago

Used older gcc for now and will decide if want to build a newer gcc module after some tests.

Also using the rather old Anaconda module as it has the needed python and numpy in a mode this Makefile prefers.

I have a binary that runs but the questions I had need answers to decide if further items are needed in the build. I'll just cut/paste them here from above. I don't know if you use the Git web interface or are reading this via email:

  1. is rc3 OK? I show you did rc2 (caffe itself)
  2. Do you need opencv and if so do you need a particular version (you've got a few but the CentOS RPM version is 2.X) (Disabled it for now until I confirm as additional effort)
  3. Misc other stuff above if I forgot ;0
tatarsky commented 8 years ago

I have made a first draft module: module load caffe and I am running the test suite as a qlogin job. I am getting many "OK" outputs but I have no experience whatsoever with this code beyond that.

I can see the answer to one of my questions in that many of the example binaries seem to be opencv based so I am working on a rebuild with that integrated but could use a version statement on the opencv desired as it will be a module as well.

Some items that were easier via RPM and version consistent are being runtime added to the nodes. I will verify this draft caffe binary executes minimally shortly before I rebuild with opencv support.

tatarsky commented 8 years ago

From my runtest output on a node:

[----------] Global test environment tear-down
[==========] 1792 tests from 252 test cases ran. (482249 ms total)
[  PASSED  ] 1792 tests.

So I'm now waiting on my questions before I proceed. Note clearly the current caffe binary does not have opencv support.

tatarsky commented 8 years ago

I have rebuilt with an OpenCV 3.0.0 dependency. Please test. Remember you will likely wish to test via an interactive qsub to the gpu queue. Be sure to read and follow the handbook (link on the front of this Git) on proper GPU scheduling etiquette.

$ module load caffe
$ caffe
caffe: command line brew
usage: caffe <command> <args>

commands:
  train           train or finetune a model
  test            score a model
  device_query    show GPU diagnostic information
  time            benchmark model execution time

There are other binaries. I have no idea how to use the tool ;)

tatarsky commented 8 years ago

And an item I'm going to state for my reference should we have issues:

the Anaconda module was used for this build. I have a feeling from ldd it has a few older versions of items in its trees including opencv 2.X. While I used opencv 3.0.0 for the build I still see that in the ldd output.

So if there are issues found in that region, I have a feeling a newer Anaconda module is needed. As I have no idea who uses the Anaconda module I would do a fresh one and then rebuild this.

So advise when possible and I'll look at that is needed still.

tatarsky commented 8 years ago

Hold on a bit on this. I have discovered that the Anaconda libraries also drag in TCL which then breaks the module command ;) So I'm trying another way to build.

tatarsky commented 8 years ago

OK. I believe back to usable without breaking module itself and also using opencv 3.0.0 and the HDF5 module directly so I'm more confident its closer to what is latest greatest.

Re-running test suite to confirm at least it feels its healthy itself ;)

aday00 commented 8 years ago

Thanks very much! Fuchs lab is now testing this Caffe, will follow up.

tatarsky commented 8 years ago

Cool. I stress I've made a best effort to use the proper settings and rebuilt items I felt it wanted newer than what was already out there. I can review any of those decisions with you but I suspect would make sense verbally.

Let me know!

aday00 commented 8 years ago

My pipelines prefer opencv 2.4.9, but opencv 3.0 is the way forward. Hopefully my 2.4.9 stuff doesn't break, haven't tested.

aday00 commented 8 years ago

In the past, whenever there's an opencv version conflict, I put the 2.4.9 binaries in the current working directory and it's fine.

aday00 commented 8 years ago

Some people prefer opencv 3.0, so I'd say stick with that until it's ever an issue. Thanks again! Super efficient.

tatarsky commented 8 years ago

I'm happy if it does to make a opencv 2.4.9 module and recompile. I just noted that the system RPM version was 2.0 which was not supported. I will be making a formal module for the opencv/3.0.0 that I did. Its currently located at: /cbio/shared/software/opencv/3.0.0 but I'll add a module for it for separate use.

You are welcome. Was an interesting build.

aday00 commented 8 years ago

Really, you're tremendous. And opencv 3.0, even 3.1.0, is the way forward. If your interest persists, opencv isn't actually complete. There's a contrib component too, and the contrib build requires a rebuild of opencv, pretty tight integration: https://github.com/itseez/opencv_contrib

tatarsky commented 8 years ago

Yeah, I actually gleaned large parts of my module layout from that doc. The main issue is the SDSC built modules for some items are getting pretty old. And the Anaconda one was linked to older OpenCV and some older HDF5. So I'm probably going to rev a few more items in the end.

I'm also fixing the default gcc we use as I did note a bug in the compile was fixed with 4.8.5. More on this will appear as open Git issues here in this repo so turn on that watch flag for it.

Appreciate the thanks.

aday00 commented 8 years ago

If you choose to install opencv_contrib also, which would help me (and perhaps others), i'll note the opencv and opencv_contrib release version should match. So if opencv is 3.0.0, opencv_contrib should also be 3.0.0. The opencv_contrib 3.1.0 expects some headers etc not present in opencv 3.0.0, for example.

tatarsky commented 8 years ago

OK. I'll take a look at that. Do your tests and when its looking at least mostly on the right track I'll rev the opencv as above stated and rebuild caffe with it.