facebookresearch / CrypTen

A framework for Privacy Preserving Machine Learning
MIT License
1.52k stars 274 forks source link

@mpc.run_multiprocess can not work correctly since CUDA initialization error occurs #511

Open SlInevitable2003 opened 1 month ago

SlInevitable2003 commented 1 month ago

I have install the requirements as instructions and I can successfully execute Tutorial 1, which may indicates that the basic dependency have been installed. However tutorial 2 can not work correctly for me. When executing

import crypten.mpc as mpc
import crypten.communicator as comm 

@mpc.run_multiprocess(world_size=2)
def examine_arithmetic_shares():
    x_enc = crypten.cryptensor([1, 2, 3], ptype=crypten.mpc.arithmetic)

    rank = comm.get().get_rank()
    crypten.print(f"\nRank {rank}:\n {x_enc}\n", in_order=True)

x = examine_arithmetic_shares()

I get the following error:

Process Process-2:
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/CrypTen/tutorials/crypten/mpc/context.py", line 29, in _launch
    crypten.init()
  File "/home/CrypTen/tutorials/crypten/mpc/context.py", line 29, in _launch
    crypten.init()
  File "/home/CrypTen/tutorials/crypten/__init__.py", line 77, in init
    _setup_prng()
  File "/home/CrypTen/tutorials/crypten/__init__.py", line 77, in init
    _setup_prng()
  File "/home/CrypTen/tutorials/crypten/__init__.py", line 202, in _setup_prng
    generators[key][device] = torch.Generator(device=device)
  File "/home/CrypTen/tutorials/crypten/__init__.py", line 202, in _setup_prng
    generators[key][device] = torch.Generator(device=device)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:root:One of the parties failed. Check past logs

It seems that the error is caused by initializing CUDA in multiple processes. Please help me with that!

Yoshi234 commented 1 month ago

I have run into the same issue previously. I'm not sure exactly what is causing this problem though. Do you have multiple CUDA devices available on your system?

SlInevitable2003 commented 1 month ago

I have run into the same issue previously. I'm not sure exactly what is causing this problem though. Do you have multiple CUDA devices available on your system?

No, my computer only has one GPU. Is that the key reason?

Yoshi234 commented 1 month ago

I actually saw that someone else had a similar issue #322 . It seems that it has to do with how GPU support was implemented and tested. However this issue is old, from around 2021 or so. They might have fixed it by this point, but I haven't been able to get CUDA working with the run_multiprocess tag either. I'm running CrypTen in a Conda environment, but I don't know if that's really the issue.

SlInevitable2003 commented 1 month ago

I actually saw that someone else had a similar issue #322 . It seems that it has to do with how GPU support was implemented and tested. However this issue is old, from around 2021 or so. They might have fixed it by this point, but I haven't been able to get CUDA working with the run_multiprocess tag either. I'm running CrypTen in a Conda environment, but I don't know if that's really the issue.

Thanks a lot! I will check that issue carefully.

Yoshi234 commented 1 month ago

After looking at issue #305 , it seems like someone else was able to get the MPC cifar experiment to run correctly on the GPU, and the bug which caused the issue has already been fixed. For some reason, these fixes don't extend to the @mpc.run_multiprocess() decorator though. I'm gonna try using this as well to get my code to run on the GPU.