LIGHT requires a single particle in each processor domain

MPAS-Dev / MPAS-Model

Repository for MPAS models and shared framework releases.

241 stars 321 forks source link

LIGHT requires a single particle in each processor domain #224

Open pwolfram opened 5 years ago

pwolfram commented 5 years ago

At present, LIGHT fails if there are no particles on a computational domain.

cc @bradyrx

pwolfram commented 5 years ago

An easy solution is to build off #56 once merged and make sure we have restarts for some small number of particles in each computational domain and have those particles be reset at something like some small number of timesteps. As far as load-balancing, given the current spatially-distributed architecture there is no way around this issue although I'm exploring this with ANL via some alternative strategies.

I can also double-check and go back and look at simpler fixes within LIGHT for this issue too.

bradyrx commented 5 years ago

Note that it seems this is only the case upon restart. I.e., while a simulation is running with LIGHT, it seems that it can continue without fail even if a processor loses all its particles. I noticed this in our 30to10-BGC runs. We have some restart files for run cycles prior to crashing that have empty processors -- it only crashes when trying to start back up from them.

make sure we have restarts for some small number of particles in each computational domain and have those particles be reset at something like some small number of timesteps.

This seems like a simple solution. So you have a very small amount of particles (as small as one) that are constantly being reset back into the computational domain? And then somehow flag these to not be analyzed by the user. Or not saved out, if possible.

pwolfram commented 5 years ago

@bradyrx, since MPAS-O is a state machine input / restarts are the same.

Your observation above is consistent with Southern Ocean-only particle runs crashing, correct? To be clear, restarts and initial runs without particles on a processor fail-- this is consistent with your experience right?

So, presumably, if this issue is fixed then you could just run Southern Ocean only runs.

The easy python-based solution would be to reset a particle to its initial cell at high-frequency so it can effectively only live on one computation domain. It would be easy to filter these particles out because they would have some small reset time scale. It is harder to make this cleaner because we don't have an easy way to limit certain particles from participating in I/O, which is lock-step limited because it leverages existing Eulerian parallel I/O routines in MPAS.

xylar commented 5 years ago

To me, this fix sounds more complicated than tracking down and fixing the restart bug...

pwolfram commented 5 years ago

It is a soft bug-- LIGHT was never designed to have anything but global applications of particles. This "bug" was essentially a use case that was not explicitly tested for and used when LIGHT was developed. It boils down to a violation of the design input spec and the "bug fix" is basically a request to soften the requirements needed for input/restart files.

Does that make more sense @xylar?

xylar commented 5 years ago

Hmm, it seems a bit more serious than that to me. At higher res with not too many particles and a lot of processors, even the original application doesn't work as expected because clustering of particles can leave a processor without any particles. This only causes trouble when a user tries to restart.

I continue to think seeding a single unneeded particle to each processor and then having to filter it out at analysis is more complicated than seeing where in the restart code there is the assumption of >0 particles and just fixing it. We know the code runs fine in these conditions because @maltrud and @bradyrx have been doing so. Maybe I'm misunderstanding how complicated that would be.