GoogleCloudPlatform / data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Apache License 2.0
1.31k stars 712 forks source link

Ch9 - training deep neural network - how to attach GPU? #165

Closed jgammerman closed 1 year ago

jgammerman commented 1 year ago

@lakshmanok - regarding my earlier issue (#164), I've ended up manually exporting the data from BQ to cloud storage using the GUI.

The rest of the notebook is working fine, but now I'm training the deep neural network it's awfully slow (I'm still on the first of the 10 epochs and it's not even half way through it after 10 minutes!).

I'm guessing that the problem is that I'm using CPUs rather than a GPU...on p.322 of the book you state "Making sure that the Vertex AI Workbench notebook that I’m working on has a GPU attached to it, I can now launch off the training job..." but if I'm not mistaken it's not covered in the textbook or notebook how to do this?

I've already set up my fully-managed notebook to enable an NVIDIA T4 GPU, but I believe that it won't be attached automatically without me doing something else.

image

The GC docs refer to creating a separate CustomJob to achieve this - is that what you did or is there a quicker way?

lakshmanok commented 1 year ago

When you launch the notebook instance, make sure to specify a machine type with a GPU. You can also stop the instance, add a GPU, and then restart it

thanks, Lak

On Fri, Feb 10, 2023, 8:41 AM James @.***> wrote:

@lakshmanok https://github.com/lakshmanok - regarding my earlier issue (

164

https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/164), I've ended up manually exporting the data from BQ to cloud storage using the GUI.

The rest of the notebook is working fine, but now I'm training the deep neural network it's awfully slow (I'm still on the first of the 10 epochs and it's not even half way through it after 10 minutes!).

I'm guessing that the problem is that I'm using CPUs rather than a GPU...on p.322 of the book you state "Making sure that the Vertex AI Workbench notebook that I’m working on has a GPU attached to it, I can now launch off the training job..." but if I'm not mistaken it's not covered in the textbook or notebook how to do this?

The GC docs refer to creating a separate CustomJob https://cloud.google.com/vertex-ai/docs/training/configure-compute#create_custom_job_gpus-console to achieve this - is that what you did or is there a quicker way?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ2FRQLPPMDNEJOI2DLWWZVT3ANCNFSM6AAAAAAUYAAVTM . You are receiving this because you were mentioned.Message ID: @.***>

jgammerman commented 1 year ago

Still no quicker I'm afraid - one epoch is taking about 20 minutes...

My notebook instance now comes with a GPU:

image

And when starting the notebook I've selected as my kernel Tensorflow 2 (Local), previously it was Python (local) :

image

I can't see any other options for specifying that my GPU should be used...

jgammerman commented 1 year ago

Also - I have a really dumb question but it's come up before in this book so I may as well ask it now...

I'd like to see the CPU/GPU usage of the VM that my notebook is running on. In other cloud platforms (eg. Azure) you have to connect the notebook to a VM manually every time, which makes this easy to do.

But in GCP, everything seems to happen in the background and it's not clear how to inspect your VM....if I go to the VM Instances API in the console, it looks like I don't have any:

image

Please could you advise? Sorry if this is a stupid question but I'm guessing it's not just me who is confused!

lakshmanok commented 1 year ago

(1) If you didn't change the line that says DEVELOP=True in the notebook, each epoch should take less than a minute. By any chance, are the compute on the notebook & the bucket region different? (2) you can look at notebook gpu/cpu usage etc. by click on the notebook name (in the Vertex Workbench area), and selecting the "Monitoring" tab

thanks Lak

On Fri, Feb 10, 2023 at 9:43 AM James @.***> wrote:

Actually I have a really dumb question but it's come up before in this book so I may as well ask it now...

I'd like to see the CPU/GPU usage of the VM that my notebook is running on. In other cloud platforms (eg. Azure) you have to connect the notebook to a VM manually every time, which makes this easy to do.

But in GCP, everything seems to happen in the background and it's not clear how to inspect your VM....if I go to the VM Instances API in the console, it looks like I don't have any:

[image: image] https://user-images.githubusercontent.com/8484188/218159743-09a7e68a-de0b-4fbf-ace3-b8a51d0248c9.png

Please could you advise? Sorry if this is a really stupid question.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/165#issuecomment-1426135782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZYEPRVZURWUX6TCQA3WWZ44NANCNFSM6AAAAAAUYAAVTM . You are receiving this because you were mentioned.Message ID: @.***>

jgammerman commented 1 year ago

1) Oh I changed DEVELOP=True to DEVELOP=False after successfully running 2 epochs very quickly, under a minute as you said. The flow of the Jupyter notebook is somewhat different to the textbook chapter so I thought that was what I was supposed to do - maybe not!

2) Unfortunately I can't see any Monitoring tab, only Logs:

image

Thanks for these quick response by the way...I'll be sure to mention them in the glowing Amazon review I give of the book once I'm done with it!

lakshmanok commented 1 year ago

Yeah, you don't need to run on the full dataset. You can just try it out on a small sample. In later chapters, I'll have you copy over my model that was trained over the whole thing.

Re: monitoring: you are using managed notebooks rather than the user-managed notebooks that I was using: https://cloud.google.com/vertex-ai/docs/workbench/managed/introduction Part of the control you give up when you ask Vertex AI to manage the notebook lifecycle is that it runs it in a tenant project, so your ability to monitor is limited Think of managed notebooks as being like Google Colab.

Lak

On Fri, Feb 10, 2023 at 10:06 AM James @.***> wrote:

1.

Oh I changed DEVELOP=True to DEVELOP=False after successfully running 2 epochs. The flow of the Jupyter notebook is somewhat different to the textbook chapter so I thought that was what I was supposed to do - maybe not! 2.

Unfortunately I can't see any Monitoring tab, only Logs:

[image: image] https://user-images.githubusercontent.com/8484188/218164225-ad6f5059-dd10-4932-8684-687c1ac75053.png

Thanks for these quick response by the way...I'll be sure to mention them in the glowing Amazon review I give of the book once I'm done with it!

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/data-science-on-gcp/issues/165#issuecomment-1426159951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANJPZ6FK7MZHGZ5DQJAWGDWWZ7R7ANCNFSM6AAAAAAUYAAVTM . You are receiving this because you were mentioned.Message ID: @.***>

jgammerman commented 1 year ago

I see! Thank you Lak.