Error in the second epoch in ML

RY4GIT commented 1 year ago

Basically facing the same issue as this one https://github.com/NWC-CUAHSI-Summer-Institute/LGAR-py/pull/11

Check this to debug https://github.com/NWC-CUAHSI-Summer-Institute/LGAR-py/blob/eec7bb4fb455b33bce7d19ee0e1ebfd588ddae37/dpLGAR/models/dpLGAR.py#L106

RY4GIT commented 1 year ago

Debug session with @taddyb

Issues found:

mlp_forward() was called twice in one epoch
self. attributes are not reset by zero_grad()
- Parameters that are tuned, e.g., self.cfe_instance.refkdt and self.cfe_instance.satdk weren't reset, so reset using torch.zeros_like(self.cfe_instance.refkdt)

Tips to debug:

Print dubious variables (refkdt, loss, satdk)
Display and browse all the attributes in the instances (especially self.cfe_instance) if there are any attributes that still have grads after resetting

To make things easier, reset parameters, instance attributes, fluxes and states, volume tracking, and then update the model with a newly predicted refkdt and satdk parameters.

def initialize(self):
    # Initialize the CFE model with the dynamic parameter
    self.cfe_instance.refkdt = torch.zeros_like(self.cfe_instance.refkdt)
    self.cfe_instance.satdk = torch.zeros_like(self.cfe_instance.satdk)
    self.cfe_instance.reset_flux_and_states()
    self.cfe_instance.reset_volume_tracking()
    self.cfe_instance.update_params(self.refkdt[:, 0], self.satdk[:, 0])

Gradient with SelectBackword is likely the slices from MLP output (so it is okay), everything else is probably the remnant from model operations from previous epoch

RY4GIT commented 1 year ago

Addressed in eb03b44481cba870a4601665fedc9cf84daa5bfc

NWC-CUAHSI-Summer-Institute / dCFE

Error in the second epoch in ML #23

Issues found:

Tips to debug: