GrainLearning / grainLearning

A Bayesian uncertainty quantification toolbox for discrete and continuum numerical models of granular materials, developed by various projects of the University of Twente (NL), the Netherlands eScience Center (NL), University of Newcastle (AU), and Hiroshima University (JP).
https://grainlearning.readthedocs.io/
GNU General Public License v2.0
9 stars 1 forks source link

in Windows wandb doesn't generate latest-run #35

Open luisaforozco opened 1 year ago

luisaforozco commented 1 year ago

When trying to merge RNN to GrainLearning the CI/CD showed that the all tests were passing for linux, macOS but not for windows. I debugged it in a windows machine and found that the issue comes from wandb (see reported issue). The error is specifically at unit test test_rnn_model.py/test_train when:

assert Path("wandb/latest-run/files/model-best.h5").exists()

Indeed, in windows, the simlink to latest-run is not automatically created by wandb.

Other options, provided by wandb, to access the files of the latest run include:

A dirty option is to manually search for the latest folder, but this seems hard to generalize across platforms: unix and win32. Including a variable platform might be an option, but it comes at a cost: complexification of the code and maintance of such code is more prone to errors.

luisaforozco commented 1 year ago

I have added a decorator to test_train so that is skipped if on windows: sys.platform=='win32'. Debugging this, I also found that keras models created and saved in macOS cannot be loaded in windows, but a model created and saved in windows can be loaded in windows. Thus, I deactivated the check for loading a full model, which is of course not ideal...

luisaforozco commented 1 year ago

I've created branch test_windows_wandb_simlink, to try a few things that wandb maintainers suggested. Specifically: wandb.init(settings=wandb.Settings(symlink=True)) But only for windows, since for other platforms such setting is not necessary. It seems that now folder latest-run exists but I got now a new error:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'wandb\\debug-cli.runneradmin.log'

Such error message is triggered when trying to remove wandb folder. I tried adding if sys.platform=='win32': wandb.finish(), but I still get the error message.

luisaforozco commented 1 year ago

After a lot of research on this issue I come to the conclusion that in wandb framework debug-cli.{user_name}.log is not closed properly until the python process is finished. This is of course not a problem in unix, but in windows is not possible to delete the folder containing such file. This is particularly annoying in the unit tests because the teardown (deleting created folders and files) cannot be completed. A possible solution is to not delete wandb folder in the windows case, but this means that windows users running the tests will get this weird wandb folder. In the case of the github runners that will not be a problem. I also tried wandb sync --clean-force but that would throw an error if the user is not logged in, and possibly pollute the wandb workspace.