charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
206 stars 50 forks source link

Record the size each object's migration data in the LB database #263

Closed PhilMiller closed 11 years ago

PhilMiller commented 11 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/263


When load balancing, the balancer may want to know how much migrating each object will cost. One major factor in this is how much data transfer a migration represents. Since the runtime can easily tell how much data it will pack if an object actually does migrate, using PUP routines, it can check that earlier too and note the result.

Thus, the runtime should check the PUP size of each object when gathering the LB database statistics, and record those in an additional field. The field and accessors should be named to clearly reflect that this is a migration size, not necessarily a working-set size, which may be larger.

harshithamenon commented 5 years ago

Original date: 2013-08-09 06:46:56


For some reason I thought this was assigned to me so I went ahead and implemented it. Later did I realized that it was assigned to Phil. So I apologize for not consulting before implementing it.

This feature is checked in a branch (objsize_lbdatabase). -Added a field for the pupsize in the LDObjData. -set the size during AtSync call. All the LB strategies can use this field to obtain the size of the data.

PhilMiller commented 5 years ago

Original date: 2013-08-09 12:15:12


Nope, I just entered it, so that the request wouldn't get lost, nor the new feature forgotten. It was unassigned until just now. All yours - enjoy!

PhilMiller commented 5 years ago

Original date: 2013-08-09 12:30:05


Just a quick review of the base implementation - the 2 GiB limit on 32-bit int might actually be a problem here (for example, consider a weakly-virtualzed BRAMS). Sizes are non-negative, so that can at least become unsigned int with no trouble. If for some reason the data isn't collected, it's compiled out, whatever, we can still return 0 as a sentinel value.

4GiB can still practically be overrun, even if it will be rarer. That suggests that we might want to use a 64-bit field. Going a bit deeper, however, a few years back there was a major push to limit the memory usage of the LB database itself. Sanjay's suggestion of a coarser resolution (e.g. units of 4k) comes in here, but I don't think it would be good to implement that wholesale, because variable migration costs of small objects can matter.

I've got an idea for a variable-resolution scheme that always exposes size_t, but may round it to varying extent. I'll implement a simple 1-step version of that feeding uint after I've had breakfast. With a little more thought on where to draw cutoffs, we can probably reduce that to short.

PhilMiller commented 5 years ago

Original date: 2013-08-09 19:50:39


OK, I've implemented the approximate encoding scheme and pushed it on a rebased version of Harshitha's branch. It packs the size in 16 bits, and it's accurate within 0.5% to well beyond the physical memory capacity of any single node we're ever likely to encounter (the limit beyond which error increases is roughly the entire memory capacity of Blue Waters). I expect lots of other code will break before this does.

Someone else should review what I've done, and then we can merge it to mainline.

PhilMiller commented 5 years ago

Original date: 2013-11-09 01:03:40


The code currently merges cleanly with mainline and builds net-darwin-x86_64 without error. The tests run without error. Jonathan, please review and merge to mainline, and close this.

lifflander commented 5 years ago

Original date: 2013-11-09 01:57:00


The following commits for data compression seem a little obscure:

58651c9d33ac4badaa24b5bd8ea7434a1dcba486 48843a36d2c8724cd2266dad1b7d48808441ef48

Specifically, "check_size_values" seems quite magical; I can infer what it does, but a little more documentation would be helpful.

PhilMiller commented 5 years ago

Original date: 2013-11-09 02:08:58


Comment added.

nikhil-jain commented 5 years ago

Original date: 2013-11-12 03:51:29


Jonathan - Did you have anything more on data compression, or can we ignore this issue for the release?

lifflander commented 5 years ago

Original date: 2013-11-12 22:12:49


Sorry, my mistake. I thought I had merged this, and it wasn't on gerrit... Merged now.