Closed EssamWisam closed 5 months ago
Hi @EssamWisam, thanks for your questions.
Yes, the randomness is a known issue, which we believe is a combination of floating point precision error and the slightly leaky behaviour of Tensorflow more generally. A future update may resolve this issue, if we can translate our codebase to a different neural network API. We would recommend saving your final trained model/imputations, to aid with any replicability requirements.
Scaling the input data is not necessary, but can make a difference. The neural network will attempt to coerce the data into whatever scale it needs to minimise the reconstruction error. The unscaled Python example is evidence in favour of this. Pre-scaling your data, however, can stabilise training as, in essence, you are asking the network to do less with the same amount of training. Ultimately, the need for pre-scaling will depend on your context and we'd recommend trying both strategies. For Python, you might consider conventional sklearn.preprocessing models to achieve this scaling prior to training the MIDAS model.
I'll leave this issue open for a week or so in case you wish to raise any other questions.
Thanks a lot. Clarifies everything.
Have tried running the Python example notebook and noticed that the final loss slightly changes from run to run (e.g., from
73446.1
to73355.3
) despite setting the same seed. Does this have to do with unaccounted for randomness in the algorithm or just due rounding?Another question, does it generally make a difference to scale the continuous data before inputting it to the algorithm? I assumed that the answer is no because it's done internally anyway; however, I noticed that in the R example the data was explictly scaled but not in the Python example.