PacktPublishing / Modern-Computer-Vision-with-PyTorch-2E

Modern Computer Vision with PyTorch, 2E, Published by Packt
MIT License
69 stars 30 forks source link

Modern-Computer-Vision-with-PyTorch, Second Edition

This is the code repository for Modern-Computer-Vision-with-PyTorch, Second Edition, published by Packt.

A practical roadmap from deep learning fundamentals to advanced applications and Generative AI

The authors of this book are - Kishore Ayyadevara and Yeshwanth Reddy

About the book

Whether you are a beginner or are looking to progress in your computer vision career, this book guides you through the fundamentals of neural networks (NNs) and PyTorch and how to implement state-of-the-art architectures for real-world tasks.

The second edition of Modern Computer Vision with PyTorch is fully updated to explain and provide practical examples of the latest multimodal models, CLIP, and Stable Diffusion.

You’ll discover best practices for working with images, tweaking hyperparameters, and moving models into production. As you progress, you'll implement various use cases for facial keypoint recognition, multi-object detection, segmentation, and human pose detection. This book provides a solid foundation in image generation as you explore different GAN architectures. You’ll leverage transformer-based architectures like ViT, TrOCR, BLIP2, and LayoutLM to perform various real-world tasks and build a diffusion model from scratch. Additionally, you’ll utilize foundation models' capabilities to perform zero-shot object detection and image segmentation. Finally, you’ll learn best practices for deploying a model to production.

By the end of this deep learning book, you'll confidently leverage modern NN architectures to solve real-world computer vision problems.

Running on a cloud platform or in your environment

To run these notebooks on a cloud platform, just click on one of the badges in the table below or run them on your environment.

Chapter Colab

Chapter 1 |

| Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab | Chapter 2 | | Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
| Chapter 3 | | Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab| Chapter 4 | | Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
Open In Colab
|

Key Takeaways

Outline and Chapter Summary

This book provides a hands-on approach to solving over 30 prominent real-world computer vision problems using PyTorch 2.x on actual datasets. Here you’ll learn to build a neural network from scratch and optimize hyperparameters, perform image classification, multi-object detection, segmentation, and more. You'll also explore facial expression manipulation and combining CV with NLP and RL techniques, build generative AI applications, and take your model to production on AWS. By the end of this book, you'll master modern NN architectures and confidently solve real-world CV problems.

Chapter 01, Artificial Neural Network Fundamentals

In this chapter, we will create a very simple architecture on a simple dataset and mainly focus on how the various building blocks (feedforward, backpropagation, and learning rate) of an ANN help in adjusting the weights so that the network learns to predict the expected outputs from given inputs. We will first learn, mathematically, what a neural network is, and then build one from scratch to have a solid foundation. Then we will learn about each component responsible for training the neural network and code them as well. Overall, we will cover the following topics:

Chapter 02, PyTorch Fundamentals

In this chapter, we will dive into the foundations of building a neural network using PyTorch, which we will leverage multiple times in subsequent chapters when we learn about various use cases in image analysis. We will start by learning about the core data type that PyTorch works on – tensor objects. We will then dive deep into the various operations that can be performed on tensor objects and how to leverage them when building a neural network model on top of a toy dataset (so that we strengthen our understanding before we gradually look at more realistic datasets, starting with the next chapter). This will allow us to gain an intuition of how to build neural network models using PyTorch to map input and output values. Finally, we will learn about implementing custom loss functions so that we can customize them based on the use case we are solving. Specifically, this chapter will cover the following topics:

Chapter 03, Building a Deep Neural Network with PyTorch

In this chapter, we will shift gears and learn how to perform image classification using neural networks. Essentially, we will learn how to represent images and tweak the hyperparameters of a neural network to understand their impact. For the sake of not introducing too much complexity and confusion, we only covered the fundamental aspects of neural networks in the previous chapter. However, there are many more inputs that we tweak in a network while training it. Typically, these inputs are known as hyperparameters. In contrast to the parameters in a neural network (which are learned during training), hyperparameters are provided by the person who builds the network. Changing different aspects of each hyperparameter is likely to affect the accuracy or speed of training a neural network. Furthermore, a few additional techniques such as scaling, batch normalization, and regularization help in improving the performance of a neural network. We will learn about these concepts throughout this chapter.

In this chapter, we will cover the following topics:

Chapter 04, Introducing Convolutional Neural Networks

In this chapter, we will learn about where traditional deep neural networks do not work. We’ll then learn about the inner workings of convolutional neural networks (CNNs) by using a toy example before understanding some of their major hyperparameters, including strides, pooling, and filters. Next, we will leverage CNNs, along with various data augmentation techniques, to solve the issue of traditional deep neural networks not having good accuracy. Following this, we will learn about what the outcome of a feature learning process in a CNN looks like. Finally, we’ll put our learning together to solve a use case: we’ll be classifying an image by stating whether the image contains a dog or a cat. By doing this, we’ll be able to understand how the accuracy of prediction varies by the amount of data available for training. By the end of this chapter, you will have a deep understanding of CNNs, which form the backbone of multiple model architectures that are used for various tasks. The following topics will be covered in this chapter:

Chapter 05, Transfer Learning for Image Classification

In this chapter, we will learn about two different families of transfer learning architectures – variants of Visual Geometry Group (VGG) architecture and variants of residual network (ResNet) architecture. Along with understanding the architectures, we will also understand their application in two different use cases, age and gender classification, where we will learn about optimizing over both cross-entropy and mean absolute error losses at the same time to estimate the age and predict the gender of a person (given an image of the person), and facial keypoint detection (detecting the keypoints like eyes, eyebrows, and chin contour, given an image of a face as input), where we will learn about leveraging neural networks to generate multiple (136, instead of 1) continuous outputs in a single prediction. Finally, we will learn about a new library that assists in reducing code complexity considerably across the remaining chapters. In summary, the following topics are covered in the chapter:

Chapter 06, Practical Aspects of Image Classification

This chapter will further solidify our understanding of CNNs and the various practical aspects to be considered when leveraging them in real-world applications. We will start by understanding the reasons why CNNs predict the classes that they do by using class activation maps (CAMs). Following this, we will learn about the various data augmentations that can be done to improve the accuracy of a model. Finally, we will learn about the various instances where models could go wrong in the real world and highlight the aspects that should be taken care of in such scenarios to avoid pitfalls. The following topics will be covered in this chapter:

Chapter 07, Basics of Object Detection

In this chapter and the next, we will learn about some of the techniques for performing object detection. We will start by learning the fundamentals – labeling the ground truth of bounding box objects using a tool named ybat, extracting region proposals using the selectivesearch method, and defining the accuracy of bounding box predictions by using the intersection over union (IoU) and mean average precision metrics. After this, we will learn about two region proposal-based networks – R-CNN and Fast R-CNN –by first learning about their working details and then implementing them on a dataset that contains images belonging to trucks and buses. The following topics will be covered in this chapter:

Chapter 08, Advanced Object Detection

In this chapter, we will learn about different modern techniques, such as Faster R-CNN, YOLO, and single-shot detector (SSD), that overcome slow inference time by employing a single model to make predictions for both the class of object and the bounding box in a single shot. We will start by learning about anchor boxes and then proceed to learn how each of the techniques works and how to implement them to detect objects in an image. We will cover the following topics in this chapter:

Chapter 09, Image Segmentation

In this chapter, we will go one step further by not only drawing a bounding box around an object but also by identifying the exact pixels that contain the object. In addition to that, by the end of this chapter, we will be able to single out instances/objects that belong to the same class. We will also learn about semantic segmentation and instance segmentation by looking at the U-Net and Mask R-CNN architectures. Specifically, we will cover the following topics:

Chapter 10, Applications of Object Detection and Segmentation

In this chapter, we will take our learning a step further – we will work on more realistic scenarios and learn about frameworks/architectures that are more optimized to solve detection and segmentation problems. We will start by leveraging the Detectron2 framework to train and detect custom objects present in an image. We will also predict the pose of humans present in an image using a pre-trained model. Furthermore, we will learn how to count the number of people in a crowd in an image and then learn about leveraging segmentation techniques to perform image colorization. Next, we will learn about a modified version of YOLO to predict 3D bounding boxes around objects by using point clouds obtained from a LIDAR sensor. Finally, we will learn about recognizing actions from a video. By the end of this chapter, you will have learned about the following:

Chapter 11, Autoencoders and Image Manipulation

In this chapter, we will learn about representing an image in a lower dimension using autoencoders and then leveraging the lower-dimensional representation of an image to generate new images by using variational autoencoders. Learning how to represent images in a lower number of dimensions helps us manipulate (modify) the images to a considerable degree. We will also learn about generating novel images that are based on the content and style of two different images. We will then explore how to modify images in such a way that the image is visually unaltered; however, the class corresponding to the image is changed from one to another. Finally, we will learn about generating deepfakes: given a source image of person A, we generate a target image of person B with a similar facial expression as that of person A. Overall, we will go through the following topics in this chapter:

Chapter 12, Image Generation Using GANs

In this chapter, we will start by learning about the idea behind what makes GANs work, before building one from scratch. GANs are a vast field that is expanding as we write this book. This chapter will lay the foundation of GANs by covering three variants; we will learn about more advanced GANs and their applications in the next chapter. In this chapter, we will explore the following topics:

Chapter 13, Advanced GANs to Manipulate Images

In this chapter, we will learn about leveraging GANs to manipulate images. We will learn about two variations of generating images using GANs – paired and unpaired methods. With the paired method, we will provide the input and output pair combinations to generate images based on an input image, which we will learn about in the Pix2Pix GAN. With the unpaired method, we will specify the input and output; however, we will not provide one-to-one correspondence between the input and output, but expect the GAN to learn the structure of the two classes, and convert an image from one class to another, which we will learn about when we discuss CycleGAN. Another class of unpaired image manipulation involves generating images from a latent space of random vectors and seeing how images change as the latent vector values change, which we will learn about in the Leveraging StyleGAN on custom images section. Finally, we will learn about leveraging a pre-trained GAN – Super Resolution Generative Adversarial Networks (SRGAN), with which we can turn a low-resolution image into an image with high resolution. Specifically, we will learn about the following topics:

Chapter 14, Combining Computer Vision and Reinforcement Learning

In this chapter, we will learn how to combine reinforcement learning-based techniques (primarily, deep Q-learning) with computer vision-based techniques. This is especially useful in scenarios where the learning environment is complex and we cannot gather data for all the cases. In such scenarios, we want the model to learn by itself in a simulated environment that resembles reality as closely as possible. Such models come in handy when used for self-driving cars, robotics, bots in games (real as well as digital), and the field of self-supervised learning, in general. We will start by learning about the basics of reinforcement learning, and then about the terminology associated with identifying how to calculate the value (Q-value) associated with taking an action in a given state. Then, we will learn about filling a Q-table, which helps to identify the value associated with various actions in a given state. We will also learn about identifying the Q-values of various actions in scenarios where coming up with a Q-table is infeasible, due to a high number of possible states; we’ll do this using Deep Q-Network (DQN). This is where we will understand how to leverage neural networks in combination with reinforcement learning. Then, we will learn about scenarios where the DQN model itself does not work, addressing this by using the DQN alongside the fixed targets model. Here, we will play a video game known as Pong by leveraging CNN in conjunction with reinforcement learning. Finally, we will leverage what we’ve learned to build an agent that can drive a car autonomously in a simulated environment – CARLA. In summary, in this chapter, we will cover the following topics:

Chapter 15, Combining Computer Vision and NLP Techniques

In this chapter, we will switch gears and learn about how a convolutional neural network (CNN) can be used in conjunction with algorithms in the broad family of transformers, which are heavily used (as of the time of writing this book) in natural language processing (NLP) to develop solutions that leverage both computer vision and NLP. To understand combining CNNs and Transformers, we will first learn about how vision transformers (ViTs) work and how they help in performing image classification. After that, we will learn about leveraging transformers to perform the transcription of handwritten images using Transformer optical character recognition (TrOCR). Next, we will learn about combining Transformers and OCR to perform question answering on document images using a technique named LayoutLM. Finally, we will learn about performing visual question answering using a transformer architecture named Bootstrapping Language Image Pre-training (BLIP2). By the end of this chapter, you will have learned about the following topics:

Chapter 16, Foundation Models in Computer Vision

In this chapter, we will learn about:

Chapter 17, Applications of Stable Diffusion

In this chapter, we will learn about the model training process and coding some of the applications of diffusion that help in achieving the above. In particular, we will cover the following topics:

Chapter 18, Moving a Model to Production

In this chapter, we will deploy a simple application, progressively improve latency while modifying model parameters/architecture, and build a mechanism to identify input drift. The following topics will be covered in this chapter:

If you feel this book is for you, get your copy today! Coding

With the following software and hardware list you can run all code files present in the book.

Software and hardware list

Software and Hardware List

Chapter Software required OS required
1 - 18 Minimum 8 GB RAM, Intel i5 processor or better Windows, Mac OS X, and Linux (Any)
NVIDIA 8+ GB graphics card – GTX1070 or better
Minimum 50 Mbps internet speed
Python 3.6 and above
PyTorch 1.7
Google Colab (can run in any browser)

Know more on the Discord server Coding

You can get more engaged on the discord server for more latest updates and discussions in the community at Discord

Download a free PDF Coding

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost. Simply click on the link to claim your free PDF. Free-Ebook Coding

We also provide a PDF file that has color images of the screenshots/diagrams used in this book at GraphicBundle Coding

Get to Know the Author

V Kishore Ayyadevara leads a team focused on using AI to solve problems in the healthcare space. He has more than 10 years' experience in the field of data science with prominent technology companies. In his current role, he is responsible for developing a variety of cutting-edge analytical solutions that have an impact at scale while building strong technical teams. Kishore has filed 8 patents at the intersection of machine learning, healthcare, and operations. Prior to this book, he authored four books in the fields of machine learning and deep learning. Kishore got his MBA from IIM Calcutta and his engineering degree from Osmania University.

Yeshwanth Reddy is a senior data scientist with a strong focus on the research and implementation of cutting-edge technologies to solve problems in the health and computer vision domains. He has filed four patents in the field of OCR. He also has 2 years of teaching experience, where he delivered sessions to thousands of students in the fields of statistics, machine learning, AI, and natural language processing. He has completed his MTech and BTech at IIT Madras.

Other Related Books