LightningWork does not move to the GPU

yuvals1 commented 2 years ago

First check

[x] I'm sure this is a bug.
[X] I've added a descriptive title to this bug.
[X] I've provided clear instructions on how to reproduce the bug.
[X] I've added a code sample.
[X] I've provided any other important info that is required.

Bug description

I am trying to run an app with LightningWork of type ServeGradio which should run on my local GPU. I am passing the LightningWork L.CloudCompute("gpu") (also tried "cuda") but It does not seem to move my model to the the GPU. When I am trying to move my model explicitly to the GPU I am getting the following error: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method. when I am not trying to move my model to the GPU the process runs.

How to reproduce the bug

class VideoServeGradio(ServeGradio):

    inputs = gr.Video()
    outputs = "playable_video"

    def __init__(self, cloud_compute, *args, **kwargs):
        super().__init__(*args, cloud_compute=cloud_compute, **kwargs)
        print("cuda", torch.cuda.is_available()) # this prints True

    def run(self):
        super().run()

    def predict(self, video):
        self.model(video)
        inferred_video_path = "./artifacts/out_vids/nonamevid.mp4" #this is the local path where the video is saved
        return inferred_video_path

    def build_model(self):
        print("cuda:", torch.cuda.is_available()) # this prints True as well
        pipe = MyPipeline(face_geometry_path=None)
        pipe.to("cuda") # this results in an error
        return pipe

class Flow(L.LightningFlow):
    def __init__(self):
        super().__init__()
        print("cuda:::::", torch.cuda.is_available())
        self.serve_work = VideoServeGradio(cloud_compute=L.CloudCompute("gpu"))

    def run(self):
        self.serve_work.run()

    def configure_layout(self):
        tab_2 = {"name": "Interactive demo", "content": self.serve_work}
        return [tab_2]

app = L.LightningApp(Flow(), debug=True)

Error messages and logs


# Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Important info


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

tchaton commented 2 years ago

Hey @yuvals1.

Thanks for trying Lightning App.

Here are some explications.

CloudCompute is meant to specify which machine you want your work to be run upon in the cloud. Therefore, it doesn't have any impact locally.
When you use torch.cuda.is_available(), this creates a cuda context and can't be passed into forked process. Remove all your prints ;)

Could you try this:

class VideoServeGradio(ServeGradio):

    inputs = gr.Video()
    outputs = "playable_video"

    def run(self):
        super().run()

    def predict(self, video):
        # Move the model to cuda in the predict method.
        model = self.model.cuda()
        video = video.cuda()
        output = model(video)
        return output.cpu().item()

    def build_model(self):
        return MyPipeline(face_geometry_path=None)

class Flow(L.LightningFlow):
    def __init__(self):
        super().__init__()
        self.serve_work = VideoServeGradio()

    def run(self):
        self.serve_work.run()

    def configure_layout(self):
        tab_2 = {"name": "Interactive demo", "content": self.serve_work}
        return [tab_2]

app = L.LightningApp(Flow(), debug=True)

yuvals1 commented 2 years ago

Hey @tchaton, Thanks for the response. So I tried your suggestion and unfortunately now the process can't find Cuda for some reason: CUDA driver initialization failed, you might not have a CUDA gpu. Any ideas why?

tchaton commented 2 years ago

Hey @yuvals1. Some progress, different errors.

Mind trying this?

    def predict(self, video):
        # Move the model to cuda in the predict method.
        torch.set_device(torch.device('cuda:0'))
        model = self.model.cuda()
        video = video.cuda()
        output = model(video)
        return output.cpu().item()

cc @awaelchli

awaelchli commented 2 years ago

Hi @yuvals1 Is PyTorch working fine on that system otherwise? Please check that this works:

python -c "import torch; torch.rand(2).to('cuda:0')"

Because the error

CUDA driver initialization failed, you might not have a CUDA gpu.

would suggest that your system/display driver is perhaps outdated?

As @tchaton said, the error

Cannot re-initialize CUDA in forked subprocess.

Is from torch and it seems that gradio.Interface().launch() that we use under the hood uses the forking method to create a subprocess. This is a limitation with torch, and thus all cuda operations should be performed inside that predict function. Hmm, I'm not sure what we could do here.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

Lightning-AI / pytorch-lightning