amoretti86 / PSVO

Implementation of Particle Smoothing Variational Objectives
6 stars 1 forks source link

PSVO math.isfinite(log_ZSMC_Train) raises Stop Training since October Fix #1

Open mobias17 opened 5 years ago

mobias17 commented 5 years ago

Hello,

some issues are encountered since the October fix on PSVO and the introduction of PSVOwR.

When running training the log_ZSMC is incredibly small (e.g. -7.x+30) on initialization and runs into nan during training.

Steps to reproduce:

System in use: Tf 1.13.1 tfp 0.5.0

With own dataset log_ZSMC improves but the Valid k-Step Rsq does not.

Any clues what configuration changes requested so that the fix runs?

wangzizhao commented 5 years ago

Hi @mobias17 !

Sorry for the late reply as for some reason I didn't receive any notification from GitHub. I just tried the notebook with the FHN dataset provided in the repo. It runs well and I just pushed the results for your reference. Besides, sorry for not updating the notebook with the PSVOwR flag change.

May I know the following to better solve the issue?

Also, the algorithm happens to have numerical issues for some random seeds, and you can try some different ones. But if that happens for all seeds, it is like a tunning problem.

Thanks!

mobias17 commented 5 years ago

Hey @wangzizhao

Thank you for your reply. Just wanted to let you that I am running a few more runs on my side and will come back to you soon.

mobias17 commented 5 years ago

Hi @wangzizhao,

Thank you for publishing your latest result lately. I used this and the results in the previous commits as benchmark.

First, I downgraded TF on my system to 1.12.0 as in your results and ran the old September commit vs. October commit on the FHN dataset prided in the repo. As a side note, both were running on an Ubuntu 18.04, 8 CPUs, 30 GB RAM, 8 GB GPU.

Given your results you uploaded, the commit looks fine in general. Apparently, this defect applies only to my environment so far…

I will try to trace the issue and think that I have identified a good starting point. If ever you have another thought on what might cause the issue, I’d be happy to hear.

One last question: is there in general major commit in progress/planned in the near future?

Thanks.

mobias17 commented 5 years ago

Hi @wangzizhao,

to give an update on the tracing. So indeed I managed to run full runs without an error and in the runs I made, the Nan-error did not occur often enough that I could find the source yet. I adjusted the code on some spots so that when I face it next time, I hope to obtain more information.

wangzizhao commented 5 years ago

Hi @mobias17!

Sorry for the late reply as these weeks I got a lot of presentations to prepare. I am glad to hear that you finish some runs without an error.

During October commit, we fixed a bug in the forward filtering. The particles are not resampled proportional to their weights but uniformly. This fixing may increase the algorithm's dependence on good initialization and make it run into NaN more frequently.

Empirically, I would also suggest trying to reduce the learning rate for more complicated datasets or datasets with longer time steps.

Sorry for the trouble of trying lots of runs.