PSVO math.isfinite(log_ZSMC_Train) raises Stop Training since October Fix

mobias17 commented 5 years ago

Hello,

some issues are encountered since the October fix on PSVO and the introduction of PSVOwR.

When running training the log_ZSMC is incredibly small (e.g. -7.x+30) on initialization and runs into nan during training.

Steps to reproduce:

Take the PSVO notebook in the package
add and define the PSVOwR flag so that the runner will work, but leave it to false
run with default parameters as distributed
Stop Training raised within the first two iterations

System in use: Tf 1.13.1 tfp 0.5.0

With own dataset log_ZSMC improves but the Valid k-Step Rsq does not.

Any clues what configuration changes requested so that the fix runs?

wangzizhao commented 5 years ago

Hi @mobias17 !

Sorry for the late reply as for some reason I didn't receive any notification from GitHub. I just tried the notebook with the FHN dataset provided in the repo. It runs well and I just pushed the results for your reference. Besides, sorry for not updating the notebook with the PSVOwR flag change.

May I know the following to better solve the issue?

Are you able to run the notebook before the October commit but not after?
What dataset are you using? As different datasets need different hyperparameters, you may need to tune them, like the learning rate.

Also, the algorithm happens to have numerical issues for some random seeds, and you can try some different ones. But if that happens for all seeds, it is like a tunning problem.

Thanks!

mobias17 commented 5 years ago

Hey @wangzizhao

Thank you for your reply. Just wanted to let you that I am running a few more runs on my side and will come back to you soon.

mobias17 commented 5 years ago

Hi @wangzizhao,

Thank you for publishing your latest result lately. I used this and the results in the previous commits as benchmark.

First, I downgraded TF on my system to 1.12.0 as in your results and ran the old September commit vs. October commit on the FHN dataset prided in the repo. As a side note, both were running on an Ubuntu 18.04, 8 CPUs, 30 GB RAM, 8 GB GPU.

September Repo: No issues, 200 Epochs running through with the Notebook parameters given.
October Repo: Still running into the raise-stop-training-err due to +nan , despite parameters as in your commit besides different seeds (0, 101). Did not pass the 150 epochs by the time the error was raised.

Given your results you uploaded, the commit looks fine in general. Apparently, this defect applies only to my environment so far…

I will try to trace the issue and think that I have identified a good starting point. If ever you have another thought on what might cause the issue, I’d be happy to hear.

One last question: is there in general major commit in progress/planned in the near future?

Thanks.

mobias17 commented 5 years ago

Hi @wangzizhao,

to give an update on the tracing. So indeed I managed to run full runs without an error and in the runs I made, the Nan-error did not occur often enough that I could find the source yet. I adjusted the code on some spots so that when I face it next time, I hope to obtain more information.

wangzizhao commented 5 years ago

Hi @mobias17!

Sorry for the late reply as these weeks I got a lot of presentations to prepare. I am glad to hear that you finish some runs without an error.

During October commit, we fixed a bug in the forward filtering. The particles are not resampled proportional to their weights but uniformly. This fixing may increase the algorithm's dependence on good initialization and make it run into NaN more frequently.

Empirically, I would also suggest trying to reduce the learning rate for more complicated datasets or datasets with longer time steps.

Sorry for the trouble of trying lots of runs.

amoretti86 / PSVO

PSVO math.isfinite(log_ZSMC_Train) raises Stop Training since October Fix #1