Test restart from adios checkpoint at scale

abbotts / psc

The PSC particle-in-cell code

0 stars 0 forks source link

Test restart from adios checkpoint at scale #3

Open abbotts opened 7 years ago

abbotts commented 7 years ago

Restarting from checkpoints has been tested and validated at small scale, but not at the large simulation size where this method will be most useful (say, 2000+ nodes)

abbotts commented 7 years ago

A small scale (8 MPI ranks on my laptop) test passes here. Still need to validate at scale, since that's where I started seeing issues. I'm just going to need to think of a way to diff output files efficiently at 1000 node scale.

abbotts commented 7 years ago

1024 node restart test (with rebalancing) is queued. Who knows long it's going to take to run, but I should be able to close this soon.

abbotts commented 7 years ago

Restarting from the checkpoint is proving to be a problem at scale. To speedup the actual write, I disable the meta-data file during output. This file is describes the data contained within the many subfiles, and is necessary to read the checkpoint back in.

It has to be constructed after the fact using "bpmeta". This process seems to be taking a very long time, > 30 min on Titan. It can be threaded but not parallelized, so I can't speed it up.

I'll get in touch with the adios devs this week and see if there's a way for me to speed it up.

abbotts commented 7 years ago

It should be noted: this process would probably be faster if I had fewer subfiles to gather the metadata from. My test case has 1024 aggregators, which is more than the number of OSTs on Lustre. There's a lot of testing that could be done to find the correct number of aggregators, then to speed up metadata construction. But, burning a couple hours on one node to construct the metadata is still better than burning even 10 minutes on thousands of nodes.