DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

NodeStats will be missing past phase data for immigrated objects #1127

Open PhilMiller opened 4 years ago

PhilMiller commented 4 years ago

Describe the bug

A load model looking N>1 phases into the past, with migration happening <N phases ago, might query missing data for an immigrated object. The objects don't call addNodeStats for all of their 'back catalog' phases of ElementStats on arrival.

To Reproduce Steps to reproduce the behavior:

  1. Example/test/snippet of code that fails
  2. Compiler, platform, libraries
  3. Run command: number of processors, threading options, etc.
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

Additional context Add any other context about the problem here.

nlslatt commented 3 years ago

@PhilMiller Is there any danger in having the NodeStats object continue to retain stats data for objects that have since been migrated away?

PhilMiller commented 3 years ago

In terms of memory, I think it would be fine, since there's a limited look back, and it will cycle out. In the worst-case of rotation, it multiplies up the peak footprint, but no reasonable application and balancer will behave anything like that.

For the load analysis that balancers do, there would have to be some exclusion of departed objects if there's any consideration of aggregate load from past phases, that gets recomputed rather than just stored. We're not doing anything of the sort right now, nor on the short-term roadmap.

If the stats file output happens per-phase, then that should be fine. If multiple phases get written at once, then we need to be cautious about making an object appear to exist in multiple places, or simply having duplicate entries.

On the whole, I think it's fine to retain it for now. The tighter invariant of "NodeStats represents exactly the currently-present objects" might be nice, if it's easy to implement. If it's meaningful work to implement, then we can defer it to when we see a need.

This raises interesting questions about load analysis for dynamic collections, whose elements might be created or destroyed from one phase to the next.

lifflander commented 3 years ago

@nlslatt Is this a problem we want to solve or just document as a restriction to the current interface?

nlslatt commented 3 years ago

@lifflander Load models that require data from more than one phase will fail if load balancing happens too frequently. I personally have no plans to run that way. We should definitely document this until resolved. We should also think hard about if there's a way to produce a more meaningful error message for a user who doesn't know about this limitation.