Liam's coop term - Githubissues

liamc2042 commented 3 years ago

Had an issue where it wasn't saving the figures/ graphs in the designated folder. Found a fix that seems to work though.

liamc2042 commented 3 years ago

It seems my WiFi card doesn't support Linux/ have drivers that support Linux so I'm going to be blocked on getting Ubuntu and docker started up. I was thinking of just getting a supported USB network adapter that I can plugin whenever I use Ubuntu.

RVSagar commented 3 years ago

It seems my WiFi card doesn't support Linux/ have drivers that support Linux so I'm going to be blocked on getting Ubuntu and docker started up. I was thinking of just getting a supported USB network adapter that I can plugin whenever I use Ubuntu.

Ah okay, no worries, this is with a dual-boot and not with a VM I presume (in a VM you could do something like Network 'bridged' mode)? This can be an issue at times with Linux/Ubuntu, a separate WiFi dongle is probably best if no drivers are available.

liamc2042 commented 3 years ago

It seems my WiFi card doesn't support Linux/ have drivers that support Linux so I'm going to be blocked on getting Ubuntu and docker started up. I was thinking of just getting a supported USB network adapter that I can plugin whenever I use Ubuntu.

Ah okay, no worries, this is with a dual-boot and not with a VM I presume (in a VM you could do something like Network 'bridged' mode)? This can be an issue at times with Linux/Ubuntu, a separate WiFi dongle is probably best if no drivers are available.

Yeah I was able to install and get the dual boot running. I'll look at some dongles on amazon and make sure they have proper driver support.

liamc2042 commented 3 years ago

General overview from last week about what I learned with the CIFAR10 tutorial. Learned what Batch size and epochs were in relation to the number of iterations the models takes, was introduced to a baseline model, the VGG model, and how it stacks convolutional layers with 3x3 filters followed by a max pooling layer. Then learned some ways to improve a baseline model using methods such as Dropout Regularization, Weight Decay and Data Augment. I then got to save a model, test it on a data set and then test it using a single photo.

liamc2042 commented 3 years ago

Milestone: Got Ubuntu connected to the internet, installed docker along with the Nvidia things and was able to launch the environment and get the sim running with the car moving forward as well.

RVSagar commented 3 years ago

Milestone: Got Ubuntu connected to the internet, installed docker along with the Nvidia things and was able to launch the environment and get the sim running with the car moving forward as well.

Nice! That's good to hear, I guess the environment is setup well to start some of our core tasks soon.

liamc2042 commented 3 years ago

Seems the first part of the Unit 4 section of the Udemy course simulation environment doesn't work (or at least I can't find one that works). Not much I can do other than just continue without this section.

liamc2042 commented 3 years ago

Been trying to run test_learning.py from the docker terminal using rosrun road_data_generation test_learning.py but the terminal can't find test_learning.py and if I navigate to the folder and do ./test_learning.py it says permission is denied. I didn't make any modification to test_learning.py yet as I just wanted to see how it ran before making my own changes.

RVSagar commented 3 years ago

Been trying to run test_learning.py from the docker terminal using rosrun road_data_generation test_learning.py but the terminal can't find test_learning.py and if I navigate to the folder and do ./test_learning.py it says permission is denied. I didn't make any modification to test_learning.py yet as I just wanted to see how it ran before making my own changes.

Oh yeah, this is common with Python scripts, just do "chmod +x test_learning.py" in the terminal (to make it runnable), and then proceed with "./test_learning.py" as you were trying.

liamc2042 commented 3 years ago

Having issue with Tensorflow and Keras. I couldn't run the test_learning.py because I was missing some packages like Tensorflow and keras. So I did the pip install for both tensorflow and keras but Tensorflow installed version 2.1 and Keras requires 2.2 or higher. I tried specifying a version for Tensorflow but that didn't work and doing pip install Tensorflow --upgrade didn't change the version either.

RVSagar commented 3 years ago

Having issue with Tensorflow and Keras. I couldn't run the test_learning.py because I was missing some packages like Tensorflow and keras. So I did the pip install for both tensorflow and keras but Tensorflow installed version 2.1 and Keras requires 2.2 or higher. I tried specifying a version for Tensorflow but that didn't work and doing pip install Tensorflow --upgrade didn't change the version either.

Is this inside the Docker container? Can you try the steps outlined in this issue to install the dependencies: https://github.com/RVSagar/uw-auto-rc-car/issues/13 (this seemed to work for me a month ago). I've been meaning to integrate them into the actual Dockerfile. As a note, if you exit out of the container, all things you've installed will be gone (the container doesn't save these things unless explicitly told to), so you can get a clean slate this way. Alternatively, if you have Tensorflow/Keras setup locally, you can use that. You don't need the Docker container for this initial work in generating images/testing network architectures.

liamc2042 commented 3 years ago

Having issue with Tensorflow and Keras. I couldn't run the test_learning.py because I was missing some packages like Tensorflow and keras. So I did the pip install for both tensorflow and keras but Tensorflow installed version 2.1 and Keras requires 2.2 or higher. I tried specifying a version for Tensorflow but that didn't work and doing pip install Tensorflow --upgrade didn't change the version either.

Is this inside the Docker container? Can you try the steps outlined in this issue to install the dependencies: #13 (this seemed to work for me a month ago). I've been meaning to integrate them into the actual Dockerfile. As a note, if you exit out of the container, all things you've installed will be gone (the container doesn't save these things unless explicitly told to), so you can get a clean slate this way. Alternatively, if you have Tensorflow/Keras setup locally, you can use that. You don't need the Docker container for this initial work in generating images/testing network architectures.

That seems to have worked. I'll use that for now!

RVSagar commented 3 years ago

I added an updated Dockerfile that includes all the Tensorflow/Keras stuff and also uses the GPU inside the container. It's available in this commit acdfe6627a6161da808d54c751d281c6ef1f09b7

You can git merge the master into your own branch and you should have the file.

To build it, you just need to type make tf in the terminal and the build process will start and successfully finish (after 10-15 mins). Then you can, as usual, ./start_docker.sh latest yes and now the new container with tensorflow will be used.

liamc2042 commented 3 years ago

I've been getting this error today which seems to be stopping me from running the test_learning.py: Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'. I didn't change anything from yesterday and I've been starting docker the same way as well so I'm unsure what may have happened.

RVSagar commented 3 years ago

I've been getting this error today which seems to be stopping me from running the test_learning.py: Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'. I didn't change anything from yesterday and I've been starting docker the same way as well so I'm unsure what may have happened.

So the training was working before (on the GPU)? I looked here: and it looks like for Ampere GPUs (3000 series NVIDIA cards), you can't run CUDA 10? The Tensorflow dockerfile is built with CUDA 10.1.

If this is indeed the issue, you could try replacing FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 with FROM nvidia/11.2.0-cudnn8-devel-ubuntu18.04 in Dockerfile.Tensorflow and try rebuilding the image and testing.

liamc2042 commented 3 years ago

I've been getting this error today which seems to be stopping me from running the test_learning.py: Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'. I didn't change anything from yesterday and I've been starting docker the same way as well so I'm unsure what may have happened.

So the training was working before (on the GPU)? I looked here: and it looks like for Ampere GPUs (3000 series NVIDIA cards), you can't run CUDA 10? The Tensorflow dockerfile is built with CUDA 10.1.

If this is indeed the issue, you could try replacing FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 with FROM nvidia/11.2.0-cudnn8-devel-ubuntu18.04 in Dockerfile.Tensorflow and try rebuilding the image and testing.

Yeah it was working before which is weird to me. I'll add that change to the file.

liamc2042 commented 3 years ago

I've been getting this error today which seems to be stopping me from running the test_learning.py: Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value 'sm_86' is not defined for option 'gpu-name'. I didn't change anything from yesterday and I've been starting docker the same way as well so I'm unsure what may have happened.

So the training was working before (on the GPU)? I looked here: and it looks like for Ampere GPUs (3000 series NVIDIA cards), you can't run CUDA 10? The Tensorflow dockerfile is built with CUDA 10.1. If this is indeed the issue, you could try replacing FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 with FROM nvidia/11.2.0-cudnn8-devel-ubuntu18.04 in Dockerfile.Tensorflow and try rebuilding the image and testing.

Yeah it was working before which is weird to me. I'll add that change to the file.

That didn't work. It says that the repository either doesn't exist or I don't have access to it/ requires docker login. I'll see what I can find online about it.

RVSagar commented 3 years ago

Sorry, I think the command should be FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04. I'm rebuilding on my computer now and I'll report back if it works

liamc2042 commented 3 years ago

I was able to build it on my computer but it comes out with a new error about not being able to "dlopen some GPU libraries". I noticed some of the commands in the Tensorflow Dockerfile have calls related to the cuda version, specifically under apt-get update, would these need to be changed to the version I'm trying to use?

RVSagar commented 3 years ago

Ah right, I forgot about the caveat that for specific Tensorflow versions, you need to use specific cuda/cudnn versions.

i.e.,

So for Tensorflow 2.1.0 that gets installed in the dockerfile, we need to use cudnn 7 and cuda 10.1. These presumably won't work with the Ampere 3000 series GPUs...It's unfortunate because we're using ROS Melodic and Python 2.7, I think we'll try to update in the future so we have Python 3 support.

I guess the best bet now is to work locally for doing all the training on your GPU. i.e., maybe install Anaconda or some Conda environment and work with Python 3, the latest Tensorflow/keras and those should support your GPU.

liamc2042 commented 3 years ago

Ah I see. I still have Anaconda on my local Windows PC so I can just push what I have done and then see about getting it to work

liamc2042 commented 3 years ago

SO I've been trying a couple different structures for the CNN. I have one that I made based on the CIFAR tutorial as well as one that a previous coop made that was based off the Nvidia structure and both are stuck at 16.33ish percent accuracy. Some of the addition like data augmentation did improve the accuracy (going from 12 to 16 percent) while others like LeakyRelu and Dropout seemed to have 0 impact on it. I also tried varying values for the region of interest and the general size of the photos, but again, neither really seemed to change the accuracy. I'm going to add a plot similar to the CIFAR just to get a better visualization of whats happening but I was also wondering if I should try increasing the data set size? Right now its 2500

RVSagar commented 3 years ago

Okay nice, thanks for trying out these different approaches. Yeah if you can make some plots we can get a better idea of how the learning is progressing and maybe spot any issues. You could also try processing the images further (e.g., calling https://github.com/RVSagar/uw-auto-rc-car/blob/master/catkin_ws/src/auto_rc_car_demos/src/auto_rc_car_demos/basic_camera.py#L162 on the images before passing to the network). I expected that the network would be able to learn well from the raw images but perhaps some more preprocessing can help. And yeah, give that a shot, try doubling to 5000. Also, just keep a copy of the models you have from these networks (the .model files). Tomorrow I'll go over the process of testing on the car simulator (even though the accuracy is lower than we want now, we can go over this so you can see how your models react while the car is driving around).

RVSagar / uw-auto-rc-car

Liam's coop term #15