JaneliaSciComp / msg

Multiplexed Shotgun Genotyping
http://genomics.princeton.edu/AndolfattoLab/MSG.html
11 stars 12 forks source link

extract ref alleles - dynamically set low memory mode based on input file size and avail. memory #19

Closed gregpinero closed 12 years ago

gregpinero commented 12 years ago

Relevant email thread:

On Nov 9, 2011, at 1:46 PM, Gregory wrote:

That's an interesting idea.

Here are two things to consider:

1. I might be mistaken but the overall program is waiting for all of the jobs in msgRun2 to finish before continuing. So we're really limited to the speed of the slowest node/individual.

So even if 399/400 jobs ran with limited memory mode = False, we'd still be waiting for the one that didn't. In which case we wouldn't benefit.

Yes, but 2 things 1 - users on the cloud will be paying for compute time per node, I assume 2 - on any machine with a limited number of nodes, we spend some time waiting for nodes to become open. As Peter and I are pushing these libraries to >384 per run (into the thousands) this will become a more serious concern. Clearing some jobs off nodes should have some impact on overall performance.

2. I am under the impression that extract-ref-alleles was a bottleneck only because it was running out of memory and using swap *. So as long as that's no longer happening, it's not a huge issue if we can make it any faster. e.g., 20 minutes vs 30 minutes of run-time while other parts of MSG might still be taking hours.

But maybe it's still worth adding since it would mean users wouldn't have to think about that setting before-hand. And if it does determine that 400/400 jobs can run in the faster mode, then it would be faster.

What do you guys think?

I am thinking that most of the time, all jobs will be small enough to not eat all the memory, so we should allow this faster mode. We will see the few nodes with excess data only rarely (especially as we improve our techniques to reduce barcode bias in the chemistry). But, those rare events slow the code to molasses.

-Greg

  • I'm assuming the run time of extract-ref-alleles grows linearly with the input file size. But I could very well be mistaken on that. I'd like to confirm this by looking at the output files from your run on the big data.

That is my assumption as well.

From: David Sent: Wednesday, November 09, 2011 12:10 PM

I have been thinking more about this. Wouldn't it be better to let the code limit memory on each node if required. We should be able to estimate how much memory is required (at least from experience) for a starting file of a particular size. It should also be possible to query the nodes for available memory. Then, if a particular file is likely to exceed the memory, then implement limit_memory for that node. Can we do that?

gregpinero commented 12 years ago

I'm thinking I could use

cat /proc/meminfo

or

totalMemory = os.popen("free -m").readlines()[?].split()[?]

Test on cluster. Test using up memory on my workstation and verify that it reports 0 free ...