Open brooksambrose opened 9 years ago
Got a little offline help from Aaron, who pointed me to the Amazon EC2 Instances info page and told me to focus on the M class of instances. Here you can learn about different speed and memory configurations. For pricing you need to check out the Amazon EC2 Pricing page. On the pricing page make sure you're looking at the US West (Oregon) region since we were advised to use that for better rates. Instead of trying to predict the requirements of this particular job accurately (which I still want to learn to do) Aaron advised to set some budget for the job and start up an instance that looks like it fits the bill. I'm deciding between these two:
Type | vCPU | ECU | Memory (GiB) | Instance Storage (GB) | Linux/UNIX Usage |
---|---|---|---|---|---|
m4.large | 2 | 6.5 | 8 | EBS Only | $0.126 per Hour |
m4.xlarge | 4 | 13 | 16 | EBS Only | $0.252 per Hour |
Plenty of folks have personal workstations bigger than these (not me), but they might work and they're cheap. Following the "penny rich pound poor" philosophy, I'll go with the .25 cent / hour option and see if it gets the job done in 5 hours. I think I can afford that. If the smaller instance wasn't enough I'd be out the dough and have to deal with starting up another one.
I think there's a couple of pieces missing in this short workflow that are worth putting in other issues:
You might be able to predict memory requirements for a big graph by running the program on smaller graphs (with similar edge statistics) and seeing if there is a power law relationship. My guess is that memory requirement is proportional to the size of the graph. Anyways, /bin/time
can tell you how much memory was used after the process completes:
$ /bin/time -f %M commandname arg1 arg2 ...
will run commandname arg arg2 ...
and report the maximum number of kilobytes used. You can replace %M
with something more fancy to get more kinds of information, man time
for details.
So the strategy if I understand it is to run some tests and try to figure out the functional form of the scaling? I'm a little dubious about my own ability to make a big extrapolation from the tests; maybe try-and-see is the better option for now.
I'd like to know how much memory I need in an instance to complete a job, but I'm not sure how best to predict that. The job script is up on this repo and involves running a C program to find maximal cliques in a graph. Is there a best practice around predicting memory usage? Is there a shell script I could run that would gather memory usage, maybe on smaller test batches?
The input data is at worst a two column matrix with 27,797,685 rows containing 352,151 unique integers. I'm not sure what the largest working memory allocation in the maximal cliques routine actually is, but I'm hoping there's a practical test that would prevent me from having to study the code (I'm not a C programmer). No surprise, I get a segfault trying to run this on the free tier BCE instance on EC2, which has 1GB of memory.
I'm grateful for any advice about the cheapest instance I can run that would finish this job!