Super Computer? - Githubissues

eqsarah commented 10 years ago

So my advisor was just telling me that we have a super computer (I know very few details) here on campus that is severely under utilized. If we could get remote access do you think it would it be possible to use it for modeling (free?) rather than renting time on Amazon servers? If you think this might be an okay possibility, I'll contact the computer science department to look into more.

cossatot commented 10 years ago

That would be awesome. It looks like the project is impressive but languishing a bit... I can't find very much recent about it.

From what I can see, the guy to talk to is: Ron Young (ron.young@nscee.edu), website here.

You can tell him that you have a relatively lightweight FEM code that you want to run tens to hundreds of thousands of embarassingly parallel simulations. Each one takes 10 minutes to an hour depending on parameters. The code is in Fortran.

eqsarah commented 10 years ago

Good news! I emailed Ron Young, and got this response:

The limiting factors that control us are:

the total amount of disk space that you'll need, and
how many concurrent processing cores your simulations will use

It sounds like we can handle your project, the above data is what we use to make sure we have available resources.

I am not sure the answers to his questions, but I bet you do. If the numbers are right, he'll meet with me very soon to discuss how I'll use the computer to run the simulations.

cossatot commented 10 years ago

Sarah,

This is great news.

We don't need a lot of disk space--I would say like 200 Mb max. We won't store any intermediate files, just the output ages (which are just little text files).

As for concurrent cores, well, as many as we can get away with. The minimum number of cores we would need at once is 1; each simulation is run independently. That being said... If we want to do this right, there might be something like 1,000-10,000 hours of processing (especially if we want to try different thermal parameters, etc.). Now, I can try to decrease the resolution of the model (although I already did a significant amount) and maybe cut part of it off to make that quicker, but it would be better if we didn't do that. There are also some other games we can play with sampling the model space (the values for the variables) to cut this down as well if we need to, but... we have a lot of variables, and this really just means a lot of computation in order to robustly test things.

The more cores we have the sooner things get done. But the model isn't a multiprocessing type thing, directly. So we don't need to have like 100 cores all the time. But if we could, that would be fabulous...

eqsarah commented 10 years ago

Great! I think if we are going to spend all this time to work out the model, then I think that we should do it right and not skimp (which I can tell is also your mindset).

I emailed the guy again to see how many cores we could have at a time, and hopefully there will be multiple available. He responded rapidly last time; hopefully it will be the same this time.

eqsarah commented 10 years ago

I heard back from Ron and he sent this link: http://wiki.nscee.edu/index.php/Main_Page And said this about running multiple cores:

I would think you would probably be using the compute-medium queue. Note: the 256 maximum cores is per process/jobs. Since you are only using 1 core per simulation, you can load several hundred jobs in the queue and let the system schedule them.

Sounds like we get hundreds of cores if we want them? I haven't had a chance to peruse the wiki page yet.

cossatot commented 10 years ago

Oh hell yeah!

Should we model NSRD too? Jeff Lee's got some data there, right? I'm only half joking.

eqsarah commented 10 years ago

I know you're half joking, but how much longer do you think it would take? It's worth a thought.....it would make a good addition to the paper to be able to compare the two right?

Maybe it's too much. I have a hard time reining myself in sometimes.

cossatot commented 10 years ago

No shit... that's probably why I woke up at 4:30 this morning to start working. Not by choice but by stress.

It wouldn't take that much longer. Like another week of work, maybe, for the modeling. Especially if we keep the same thermal parameters, etc. This is why computers are cool; they automate things. Once the code is written, all the work is in getting the data and writing the paper. If it would really increase the impact of the paper (NSRD is the famous one, right?) then we should do it.

Also, what's your time frame here? I mean ASAP as always, but are you under pain of death to get this out by April/May or anything?

eqsarah commented 10 years ago

ASAP, but no pain of death. The sooner the better, it's such a low hanging fruit that I don't want to get scooped. It'd be great to be done before the start of the field season (late May-middle June) and have it out to my advisor for revisions, but that is not an absolute. Right now its being limited by me not getting out the thermal parameters (still looking for a reasonable source for Moho temp, about to say screw it and go with 1200-1300, fairly typical values) and dealing with the synextensional magmatism, so its my fault that its dragging right now.

And yes the NSRD is the famous one that has been a source of contention since the eighties and some more papers dealing with the contention published in within the last several years (since 2010).

eqsarah commented 10 years ago

I have a short meeting with the cluster guy on Monday to talk about how running the simulations. He works fast for sure!

Is there anything that you would like me to specifically mention/ask questions about? Also, I think it might be good if I have a some example files I want to run to take with me.

Let me know what you think, and if you think there is enough information for us to have an "in-person" conversation about this via telephone or Skype let me know--it might be easier than writing text back and forth for me so that I can really understand where I have questions/what exactly I need to convey.

cossatot commented 10 years ago

Sarah,

I'm not sure if example files will really be of much use; they're very tailored towards the system that they are run on.

The basic run-down of the operation, as it currently stands (using Amazon's AWS servers), is this:

A Python script (run on a local machine) goes through the lists of variables/parameters, and comes up with a list of all unique combinations of variables that satisfy our constraints.
Then, for each combo, the script
starts an Ubuntu linux instance on the AWS cloud, w/ Pecube and all the data files pre-installed
modifies the Pecube input files for these particular variables on that instance
Runs Pecube
Reads the text output files from Pecube and saves them as Python objects on the local machine (this can also be saved as csv files or whatever)

I have never worked with a cluster, so I don't know exactly how they work. Here are the questions that I have:

How can we compile the Pecube code (which is in FORTRAN) on their cluster? (you may have to work directly with him to do this. In general it uses a 'makefile' that we modify to list the location of the gfortran and gcc compilers).

How can we install the necessary Python packages?

Then, basically, how would we do something similar to the above workflow, or at least accomplishing hte same goals?

eqsarah commented 10 years ago

Alright, so I had the meeting with Ron and he said that this doesn't sound like a hard thing to do. In fact, the python code can be used on the cluster rather than the local machine if we like. The way he described it me was the cluster has a frontend linux machine and then three different parts of the cluster attached to it. We would use the portion of the cluster with 72 compute nodes w/12CPU/node. The Python script could be run on the frontend linux section and then input into each different node for the CPU.

Unfortunately, I think I wasn't very clear or understandable about the makefile/compiling for the FORTRAN code because I obviously don't really understand this much either right now. So what I'm going to do is spend some time reading their wiki page and then trying to set up my account with the cluster. Then you and I can try to go from there.

He seems very responsive and helpful, and it's free since I'm a student of UNLV. However, I hope we don't need too much of their help in the end because he said that he would then request to be second author on the paper, which seems a little bizarre if you ask me given that although the actual running of the code is very important to the paper it isn't really the scientific interpretation behind the paper. I guess we'll just have to see how it goes and if we seem to be leaning to heavily on him we can always go back to the Amazon server method.

eqsarah commented 10 years ago

Ok, um.....I don't understand any thing about this right now. I'm trying not to hyperventilate; this is all so much Greek to me. To stop the panic attack, let's start with some easy questions:

(1) According to their wiki page Centos or Unbuntu should be installed to interact with the cluster. Do you know anything about either one of these operating systems?
--I have never used either one. I looked at Unbuntu and I guess I'm kind of scared to partition my hard drive on my Mac for something that I won't be doing for very long....and also this computer was expensive and I am concerned that I'll really mess something up. Do you have any suggestions or comments about this particular operating system?

(2) Do you know which version of FORTRAN PeCube uses? I think these are the available versions: CC 4.4.6 gnu/gcc-4.4.6 Vendor (CentOS) supplied version the GNU Compiler collection GCC 4.5.3 gnu/gcc-4.5.3 NSCEE installed version 4.5.3 of the GNU Compiler collection GCC 4.6.3 gnu/gcc-4.6.3 NSCEE installed version 4.6.3 of the GNU Compiler collection GCC 4.7.1 gnu/gcc-4.7.1 NSCEE installed version 4.7.1 of the GNU Compiler collection

Ugh. I'm sorry I am the rather ignorant intermediary between you and the cluster. I'm trying; I swear.

cossatot / ssrd_pecube

Super Computer? #7