Exawind / nalu-wind

Solver for wind farm simulations targeting exascale computational platforms
https://nalu-wind.readthedocs.io
Other
122 stars 83 forks source link

Clear sync state of fields for restart #1241

Closed psakievich closed 5 months ago

psakievich commented 6 months ago

Sync issues are arising for fields that have been messed with prior to the populate_restart call. For restart, we are just reading from file so sync state should be clear when reading.

Closes #1191

lawrenceccheung commented 6 months ago

Thanks @psakievich. Note that the current fix generates this error message:

  >> 593    /projects/wind_uq/lcheung/NaluBuilds/exawind-manager.attaway.20240220/exawind-manager/environments/naluwindNateVerOF34patch/nalu-wind/src/Realm.C:3416:9: error: 'fld' was not declared in this scope

so I put it inside the loop:

        for (unsigned i = 0; i < numStates; ++i) {
          auto* fld = field->field_state(static_cast<stk::mesh::FieldState>(i));
          fld->clear_sync_state();
          fld->modify_on_host();
          ngp_field_manager().get_field<double>(fld->mesh_meta_data_ordinal());
          fld->sync_to_device();
        }

We're testing this out now, let's see if fixes the problems.

Lawrence

alanw0 commented 6 months ago

It doesn't make sense to do both clear_sync_state and also modify/sync. Where is the field being modified? You should only do modify if you are modifying the field data. Alternatively you do clear_sync_state if you intend to overwrite the field data and you don't care which memory space has the most recent previous modification.

psakievich commented 5 months ago

I'm second guessing this now too @alanw0. The modification is on L3419, but the states are all screwed up.

alanw0 commented 5 months ago

I'm second guessing this now too @alanw0. The modification is on L3419, but the states are all screwed up.

What do you mean by 'screwed up'? Do you think that the I/O read is failing to mark some fields or field-states as modify-on-host? We have modify_on_host calls in the I/O, but let us know if you think something is missing. Perhaps the I/O is only reading state new, and marking that field as modified-on-host, but then state old is left un-marked? Or vice-versa? That should be ok, since it would get fixed the next time you do a state rotation, or the next time you need the field (of either state) on device, which is where sync_to_device would get called.

Keep us posted if you think stk is missing some calls or has things wrong.

psakievich commented 5 months ago

@alanw0 #1190 shouldn't be happening is what I mean. I'm also not really sure why we need this per component code here either though. Seems like it is likely redundant. I also don't think we need a sync to device here either. That should happen at the calling points for the code using the fields on device. Seems like this could be excessive host-device communication.

Seems like we should be able to just delete all of this: https://github.com/Exawind/nalu-wind/blob/56c0e6dbca1255eba0e2f077bc59a1b4e286669c/src/Realm.C#L3421-L3435

Provided the ioBroker marks fields as modified on host when it reads them in.

lawrenceccheung commented 5 months ago

My apologies, I should have reported this earlier -- adding the fld->clear_sync_state(); inside the loop did fix the modify on host/device problems that Ken and Myra were seeing in their runs.

Lawrence