Fail to run with torch.multiprocessing

dnkhanh45 commented 1 year ago

With argument num_threads > 1, I got this error: _AttributeError: 'Server' object has no attribute 'delayed_communicatewith' Can someone help me? Thank you very much!

WwZzz commented 1 year ago

Sorry, this is a bug when simultaneously using torch.multiprocessing and decorators in python. The 'spawn' mode is incompatible with the decorators. I have tried to search for a proper solution for this bug, and I solved it by instead realizing the function of 'delayed_communicate_with' in another decorator that will not be called in subprocess in my new repo 'FLGo'. A quick solution to this bug is to comment the decorators on 'fedbase.BasicServer.communicate' and 'fedbase.BasicClient.train' if there is no need to construct the system heterogeneity. I will move the same change from FLGo to this repo and fix this bug as soon as possible. Thanks for your issue.

dnkhanh45 commented 1 year ago

Thank you very much for your reply. I've try your method and one more which comment all the decorators on fedbase.BasicServer.communicate, fedbase.BasicClient.train and fedbase.BasicServer.communicate_with but it still does not work.

First method got the same error: _'Server' object has no attribute 'delayed_communicatewith'
The second got: _CUDA error: CUBLAS_STATUS_NOTINITIALIZED when calling cublasCreate(handle)

dnkhanh45 commented 1 year ago

I am going to switch to your new repo FLGo. Thank you again for your projects, they've helped me a lot.

WwZzz commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

WwZzz commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

WwZzz commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

the running option is --gpu 0 --num_threads 6 --server_with_cpu --logger simple_logger

dnkhanh45 commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that:

WwZzz commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that:

I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?

dnkhanh45 commented 1 year ago

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that:

I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?

I've got the same error with CPU. It cannot start to train round 2.

WwZzz / easyFL

Fail to run with torch.multiprocessing #15