ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
704 stars 217 forks source link

Ressource Status (incl. Memory) Singleton #850

Open ax3l opened 9 years ago

ax3l commented 9 years ago

Some applications such as paraview have a very nice overview of memory consumption of connected clients (see Memory Inspector).

We should add a new Singleton class that contains simple key-value pairs (name (string), e.g. Bytes _(uint64t)) for all relevant allocations we do on the device. This can be evaluated during init, e.g., when the fields are declared and set to zero (getMemory before/after) and when mallocMC is initialized to allocate it's heap (getMemory before/after).

-> 10.4 GB of 11.25 GB on K80 used (with ECC 6.25% of 12 GB memory are used for ECC bits).

mallocMC again: used/free MBytes, maybe bytes per species (if possible/feasible) (not feasible)

With that, a full overview about the total GPU memory can be provided and furthermore, mallocMC's own "getFreeMemory" calls should be in-cooperated.

The background of that is, that it is extremely hard to quantify and to predict when a GPU might run out of memory for simulations. We must allow the users to query the memory consumption as transparent as possible, since they fine-tune their particle-per-cell and cells-per-GPU inputs to the resources they have available. The feedback from the simulation to support them should be more than a vague "crashes"/"does not crash immediately"/"crashes after N steps" as it is now :)

@slizzered @psychocoderHPC that might be something for you.

erikzenker commented 8 years ago

I want to hook into this discussion because I think we could use this singleton in a more general manner. This singleton should be something like a resource status monitor which provides us with per node information such as:

I would like to use these information for a load monitoring plugin which is able to dump the information on disk or transfer them to a live monitoring web application.

The evaluation of this data will show us load imbalance in our code (which could be solved manually) and will give us a good motivation to work on an automatic mechanism for better load balance.

erikzenker commented 8 years ago

I was already talking to @ax3l and he told me that there exists already some information such as the amount of particles on a node/rank. Was there other work already done with respect to my previous post ? @psychocoderHPC

ax3l commented 8 years ago

@erikzenker just for the first question, this is the particle counting per device (available PMacc one-liner)

ax3l commented 8 years ago
erikzenker commented 8 years ago

Thx @ax3l for providing these information