Modern-Computer-Vision-with-PyTorch, Second Edition

This is the code repository for Modern-Computer-Vision-with-PyTorch, Second Edition, published by Packt.

A practical roadmap from deep learning fundamentals to advanced applications and Generative AI

The authors of this book are - Kishore Ayyadevara and Yeshwanth Reddy

About the book

Whether you are a beginner or are looking to progress in your computer vision career, this book guides you through the fundamentals of neural networks (NNs) and PyTorch and how to implement state-of-the-art architectures for real-world tasks.

The second edition of Modern Computer Vision with PyTorch is fully updated to explain and provide practical examples of the latest multimodal models, CLIP, and Stable Diffusion.

You’ll discover best practices for working with images, tweaking hyperparameters, and moving models into production. As you progress, you'll implement various use cases for facial keypoint recognition, multi-object detection, segmentation, and human pose detection. This book provides a solid foundation in image generation as you explore different GAN architectures. You’ll leverage transformer-based architectures like ViT, TrOCR, BLIP2, and LayoutLM to perform various real-world tasks and build a diffusion model from scratch. Additionally, you’ll utilize foundation models' capabilities to perform zero-shot object detection and image segmentation. Finally, you’ll learn best practices for deploying a model to production.

By the end of this deep learning book, you'll confidently leverage modern NN architectures to solve real-world computer vision problems.

Running on a cloud platform or in your environment

To run these notebooks on a cloud platform, just click on one of the badges in the table below or run them on your environment.

Chapter	Colab

Chapter 1 |

Back_propagation.ipynb
Chain_rule.ipynb
Feed_forward_propagation.ipynb
Gradient_descent.ipynb
Learning_rate.ipynb

| Chapter 2 |

Auto_gradient_of_tensors.ipynb
Building_a_neural_network_using_PyTorch_on_a_toy_dataset.ipynb
Fetching_values_of_intermediate_layers.ipynb
Implementing_custom_loss_function.ipynb
Initializing_a_tensor.ipynb
Numpy_Vs_Torch_object_computation_speed_comparison.ipynb
Operations_on_tensors.ipynb
Sequential_method_to_build_a_neural_network.ipynb
Specifying_batch_size_while_training_a_model.ipynb
save_and_load_pytorch_model.ipynb

| Chapter 3 |

Batch_normalization.ipynb
Impact_of_building_a_deeper_neural_network.ipynb
Impact_of_dropout.ipynb
Impact_of_regularization.ipynb
Inspecting_color_images.ipynb
Inspecting_grayscale_images.ipynb
Learning_rate_annealing.ipynb
Preparing_our_data.ipynb
Scaling_the_dataset.ipynb
Steps_to_build_a_neural_network_on_FashionMNIST.ipynb
Varying_batch_size.ipynb
Varying_learning_rate_on_non_scaled_data.ipynb
Varying_learning_rate_on_scaled_data.ipynb
Varying_loss_optimizer.ipynb

| Chapter 4 |

CNN_on_FashionMNIST.ipynb
CNN_working_details.ipynb
Cats_Vs_Dogs.ipynb
Data_augmentation_with_CNN.ipynb
Image_augmentation.ipynb
Issues_with_image_translation.ipynb
Time_comparison_of_augmentation_scenario.ipynb
Visualizing_the_filters'_learning.ipynb

Key Takeaways

Get to grips with various transformer-based architectures for computer vision, CLIP, Segment-Anything, and Stable Diffusion, and test their applications, such as in-painting and pose transfer
Combine CV with NLP to perform OCR, key-value extraction from document images, visual question-answering, and generative AI tasks
Implement multi-object detection and segmentation
Leverage foundation models to perform object detection and segmentation without any training data points
Learn best practices for moving a model to production

Outline and Chapter Summary

This book provides a hands-on approach to solving over 30 prominent real-world computer vision problems using PyTorch 2.x on actual datasets. Here you’ll learn to build a neural network from scratch and optimize hyperparameters, perform image classification, multi-object detection, segmentation, and more. You'll also explore facial expression manipulation and combining CV with NLP and RL techniques, build generative AI applications, and take your model to production on AWS. By the end of this book, you'll master modern NN architectures and confidently solve real-world CV problems.

Chapter 01, Artificial Neural Network Fundamentals

In this chapter, we will create a very simple architecture on a simple dataset and mainly focus on how the various building blocks (feedforward, backpropagation, and learning rate) of an ANN help in adjusting the weights so that the network learns to predict the expected outputs from given inputs. We will first learn, mathematically, what a neural network is, and then build one from scratch to have a solid foundation. Then we will learn about each component responsible for training the neural network and code them as well. Overall, we will cover the following topics:

Comparing AI and traditional machine learning
Learning about the ANN building blocks
Implementing feedforward propagation
Implementing backpropagation
Putting feedforward propagation and backpropagation together
Understanding the impact of the learning rate
Summarizing the training process of a neural network

Chapter 02, PyTorch Fundamentals

In this chapter, we will dive into the foundations of building a neural network using PyTorch, which we will leverage multiple times in subsequent chapters when we learn about various use cases in image analysis. We will start by learning about the core data type that PyTorch works on – tensor objects. We will then dive deep into the various operations that can be performed on tensor objects and how to leverage them when building a neural network model on top of a toy dataset (so that we strengthen our understanding before we gradually look at more realistic datasets, starting with the next chapter). This will allow us to gain an intuition of how to build neural network models using PyTorch to map input and output values. Finally, we will learn about implementing custom loss functions so that we can customize them based on the use case we are solving. Specifically, this chapter will cover the following topics:

Installing PyTorch
PyTorch tensors
Building a neural network using PyTorch
Using a sequential method to build a neural network
Saving and loading a PyTorch model

Chapter 03, Building a Deep Neural Network with PyTorch

In this chapter, we will shift gears and learn how to perform image classification using neural networks. Essentially, we will learn how to represent images and tweak the hyperparameters of a neural network to understand their impact. For the sake of not introducing too much complexity and confusion, we only covered the fundamental aspects of neural networks in the previous chapter. However, there are many more inputs that we tweak in a network while training it. Typically, these inputs are known as hyperparameters. In contrast to the parameters in a neural network (which are learned during training), hyperparameters are provided by the person who builds the network. Changing different aspects of each hyperparameter is likely to affect the accuracy or speed of training a neural network. Furthermore, a few additional techniques such as scaling, batch normalization, and regularization help in improving the performance of a neural network. We will learn about these concepts throughout this chapter.

In this chapter, we will cover the following topics:

Representing an image
Why leverage neural networks for image analysis?
Preparing data for image classification
Training a neural network
Scaling a dataset to improve model accuracy
Understanding the impact of varying the batch size
Understanding the impact of varying the loss optimizer
Understanding the impact of varying the learning rate
Building a deeper neural network
Understanding the impact of batch normalization
The concept of overfitting

Chapter 04, Introducing Convolutional Neural Networks

In this chapter, we will learn about where traditional deep neural networks do not work. We’ll then learn about the inner workings of convolutional neural networks (CNNs) by using a toy example before understanding some of their major hyperparameters, including strides, pooling, and filters. Next, we will leverage CNNs, along with various data augmentation techniques, to solve the issue of traditional deep neural networks not having good accuracy. Following this, we will learn about what the outcome of a feature learning process in a CNN looks like. Finally, we’ll put our learning together to solve a use case: we’ll be classifying an image by stating whether the image contains a dog or a cat. By doing this, we’ll be able to understand how the accuracy of prediction varies by the amount of data available for training. By the end of this chapter, you will have a deep understanding of CNNs, which form the backbone of multiple model architectures that are used for various tasks. The following topics will be covered in this chapter:

The problem with traditional deep neural networks
Building blocks of a CNN
Implementing a CNN
Classifying images using deep CNNs
Implementing data augmentation
Visualizing the outcome of feature learning
Building a CNN for classifying real-world images

Chapter 05, Transfer Learning for Image Classification

In this chapter, we will learn about two different families of transfer learning architectures – variants of Visual Geometry Group (VGG) architecture and variants of residual network (ResNet) architecture. Along with understanding the architectures, we will also understand their application in two different use cases, age and gender classification, where we will learn about optimizing over both cross-entropy and mean absolute error losses at the same time to estimate the age and predict the gender of a person (given an image of the person), and facial keypoint detection (detecting the keypoints like eyes, eyebrows, and chin contour, given an image of a face as input), where we will learn about leveraging neural networks to generate multiple (136, instead of 1) continuous outputs in a single prediction. Finally, we will learn about a new library that assists in reducing code complexity considerably across the remaining chapters. In summary, the following topics are covered in the chapter:

Introducing transfer learning
Understanding the VGG16 and ResNet architectures
Implementing facial keypoint detection
Multi-task learning: Implementing age estimation and gender classification
Introducing the torch_snippets library

Chapter 06, Practical Aspects of Image Classification

This chapter will further solidify our understanding of CNNs and the various practical aspects to be considered when leveraging them in real-world applications. We will start by understanding the reasons why CNNs predict the classes that they do by using class activation maps (CAMs). Following this, we will learn about the various data augmentations that can be done to improve the accuracy of a model. Finally, we will learn about the various instances where models could go wrong in the real world and highlight the aspects that should be taken care of in such scenarios to avoid pitfalls. The following topics will be covered in this chapter:

Generating CAMs
Understanding the impact of batch normalization and data augmentation
Practical aspects to take care of during model implementation Further, you will learn about the preceding topics by implementing models to:
Predict whether a cell image indicates malaria
Classify road signals

Chapter 07, Basics of Object Detection

In this chapter and the next, we will learn about some of the techniques for performing object detection. We will start by learning the fundamentals – labeling the ground truth of bounding box objects using a tool named ybat, extracting region proposals using the selectivesearch method, and defining the accuracy of bounding box predictions by using the intersection over union (IoU) and mean average precision metrics. After this, we will learn about two region proposal-based networks – R-CNN and Fast R-CNN –by first learning about their working details and then implementing them on a dataset that contains images belonging to trucks and buses. The following topics will be covered in this chapter:

Introducing object detection
Creating a bounding box ground truth for training
Understanding region proposals
Understanding IoU, non-max suppression, and mean average precision
Training R-CNN-based custom object detectors
Training Fast R-CNN-based custom object detectors

Chapter 08, Advanced Object Detection

In this chapter, we will learn about different modern techniques, such as Faster R-CNN, YOLO, and single-shot detector (SSD), that overcome slow inference time by employing a single model to make predictions for both the class of object and the bounding box in a single shot. We will start by learning about anchor boxes and then proceed to learn how each of the techniques works and how to implement them to detect objects in an image. We will cover the following topics in this chapter:

Components of modern object detection algorithms
Training Faster R-CNN on a custom dataset
Working details of YOLO
Training YOLO on a custom dataset
Working details of SSD
Training SSD on a custom dataset In addition to the above, as a bonus, we have covered the following in the GitHub repository:
Training YOLOv8
Training EfficientDet architecture

Chapter 09, Image Segmentation

In this chapter, we will go one step further by not only drawing a bounding box around an object but also by identifying the exact pixels that contain the object. In addition to that, by the end of this chapter, we will be able to single out instances/objects that belong to the same class. We will also learn about semantic segmentation and instance segmentation by looking at the U-Net and Mask R-CNN architectures. Specifically, we will cover the following topics:

Exploring the U-Net architecture
Implementing semantic segmentation using U-Net
Exploring the Mask R-CNN architecture
Implementing instance segmentation using Mask R-CNN

Chapter 10, Applications of Object Detection and Segmentation

In this chapter, we will take our learning a step further – we will work on more realistic scenarios and learn about frameworks/architectures that are more optimized to solve detection and segmentation problems. We will start by leveraging the Detectron2 framework to train and detect custom objects present in an image. We will also predict the pose of humans present in an image using a pre-trained model. Furthermore, we will learn how to count the number of people in a crowd in an image and then learn about leveraging segmentation techniques to perform image colorization. Next, we will learn about a modified version of YOLO to predict 3D bounding boxes around objects by using point clouds obtained from a LIDAR sensor. Finally, we will learn about recognizing actions from a video. By the end of this chapter, you will have learned about the following:

Multi-object instance segmentation
Human pose detection
Crowd counting
Image colorization
3D object detection with point clouds
Action recognition from video

Chapter 11, Autoencoders and Image Manipulation

In this chapter, we will learn about representing an image in a lower dimension using autoencoders and then leveraging the lower-dimensional representation of an image to generate new images by using variational autoencoders. Learning how to represent images in a lower number of dimensions helps us manipulate (modify) the images to a considerable degree. We will also learn about generating novel images that are based on the content and style of two different images. We will then explore how to modify images in such a way that the image is visually unaltered; however, the class corresponding to the image is changed from one to another. Finally, we will learn about generating deepfakes: given a source image of person A, we generate a target image of person B with a similar facial expression as that of person A. Overall, we will go through the following topics in this chapter:

Understanding and implementing autoencoders
Understanding convolutional autoencoders
Understanding variational autoencoders
Performing an adversarial attack on images
Performing neural style transfer
Generating deepfakes

Chapter 12, Image Generation Using GANs

In this chapter, we will start by learning about the idea behind what makes GANs work, before building one from scratch. GANs are a vast field that is expanding as we write this book. This chapter will lay the foundation of GANs by covering three variants; we will learn about more advanced GANs and their applications in the next chapter. In this chapter, we will explore the following topics:

Introducing GANs
Using GANs to generate handwritten digits
Using DCGANs to generate face images
Implementing conditional GANs

Chapter 13, Advanced GANs to Manipulate Images

In this chapter, we will learn about leveraging GANs to manipulate images. We will learn about two variations of generating images using GANs – paired and unpaired methods. With the paired method, we will provide the input and output pair combinations to generate images based on an input image, which we will learn about in the Pix2Pix GAN. With the unpaired method, we will specify the input and output; however, we will not provide one-to-one correspondence between the input and output, but expect the GAN to learn the structure of the two classes, and convert an image from one class to another, which we will learn about when we discuss CycleGAN. Another class of unpaired image manipulation involves generating images from a latent space of random vectors and seeing how images change as the latent vector values change, which we will learn about in the Leveraging StyleGAN on custom images section. Finally, we will learn about leveraging a pre-trained GAN – Super Resolution Generative Adversarial Networks (SRGAN), with which we can turn a low-resolution image into an image with high resolution. Specifically, we will learn about the following topics:

Leveraging the Pix2Pix GAN
Leveraging CycleGAN
Leveraging StyleGAN on custom images
Super-resolution GAN

Chapter 14, Combining Computer Vision and Reinforcement Learning

In this chapter, we will learn how to combine reinforcement learning-based techniques (primarily, deep Q-learning) with computer vision-based techniques. This is especially useful in scenarios where the learning environment is complex and we cannot gather data for all the cases. In such scenarios, we want the model to learn by itself in a simulated environment that resembles reality as closely as possible. Such models come in handy when used for self-driving cars, robotics, bots in games (real as well as digital), and the field of self-supervised learning, in general. We will start by learning about the basics of reinforcement learning, and then about the terminology associated with identifying how to calculate the value (Q-value) associated with taking an action in a given state. Then, we will learn about filling a Q-table, which helps to identify the value associated with various actions in a given state. We will also learn about identifying the Q-values of various actions in scenarios where coming up with a Q-table is infeasible, due to a high number of possible states; we’ll do this using Deep Q-Network (DQN). This is where we will understand how to leverage neural networks in combination with reinforcement learning. Then, we will learn about scenarios where the DQN model itself does not work, addressing this by using the DQN alongside the fixed targets model. Here, we will play a video game known as Pong by leveraging CNN in conjunction with reinforcement learning. Finally, we will leverage what we’ve learned to build an agent that can drive a car autonomously in a simulated environment – CARLA. In summary, in this chapter, we will cover the following topics:

Learning the basics of reinforcement learning
Implementing Q-learning
Implementing deep Q-learning
Implementing deep Q-learning with fixed targets
Implementing an agent to perform autonomous driving

Chapter 15, Combining Computer Vision and NLP Techniques

In this chapter, we will switch gears and learn about how a convolutional neural network (CNN) can be used in conjunction with algorithms in the broad family of transformers, which are heavily used (as of the time of writing this book) in natural language processing (NLP) to develop solutions that leverage both computer vision and NLP. To understand combining CNNs and Transformers, we will first learn about how vision transformers (ViTs) work and how they help in performing image classification. After that, we will learn about leveraging transformers to perform the transcription of handwritten images using Transformer optical character recognition (TrOCR). Next, we will learn about combining Transformers and OCR to perform question answering on document images using a technique named LayoutLM. Finally, we will learn about performing visual question answering using a transformer architecture named Bootstrapping Language Image Pre-training (BLIP2). By the end of this chapter, you will have learned about the following topics:

Implementing ViT for image classification
Implementing LayoutLM for document question answering
Transcribing handwritten images
Visual question answering using BLIP2

Chapter 16, Foundation Models in Computer Vision

In this chapter, we will learn about:

Leveraging image and text embeddings to identify the most relevant image for a given text and vice versa
Leveraging image and text encodings to perform zero-shot image object detection and segmentation
Building a diffusion model from scratch, using which we’ll generate images both conditionally (with text prompting) and unconditionally
Prompt engineering to generate better images More specifically, we will learn about contrastive language-image pre-training (CLIP), which can identify images relevant to a given text and vice versa, combining an image encoder and prompt (text) encoder to identify regions/segments within an image. We’ll learn how to leverage CNN-based architectures with the Segment Anything Model (SAM) to perform zero-shot segmentation ~50X faster than transformer-based zero-shot segmentation. We will also learn about the fundamentals of Stable Diffusion models and the XL variant.

Chapter 17, Applications of Stable Diffusion

In this chapter, we will learn about the model training process and coding some of the applications of diffusion that help in achieving the above. In particular, we will cover the following topics:

In-painting
ControlNet
DepthNet
SDXL Turbo
Text2Video

Chapter 18, Moving a Model to Production

In this chapter, we will deploy a simple application, progressively improve latency while modifying model parameters/architecture, and build a mechanism to identify input drift. The following topics will be covered in this chapter:

Understanding the basics of an API
Creating an API and making predictions on a local server
Quantizing a model to fp16
Identifying data drift
Building a vector store using Facebook AI Similarity Search (FAISS)

If you feel this book is for you, get your copy today!

With the following software and hardware list you can run all code files present in the book.

Software and hardware list

Software and Hardware List

Chapter	Software required	OS required
1 - 18	Minimum 8 GB RAM, Intel i5 processor or better	Windows, Mac OS X, and Linux (Any)
	NVIDIA 8+ GB graphics card – GTX1070 or better
	Minimum 50 Mbps internet speed
	Python 3.6 and above
	PyTorch 1.7
	Google Colab (can run in any browser)

Know more on the Discord server

You can get more engaged on the discord server for more latest updates and discussions in the community at Discord

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost. Simply click on the link to claim your free PDF. Free-Ebook

We also provide a PDF file that has color images of the screenshots/diagrams used in this book at GraphicBundle

Get to Know the Author

V Kishore Ayyadevara leads a team focused on using AI to solve problems in the healthcare space. He has more than 10 years' experience in the field of data science with prominent technology companies. In his current role, he is responsible for developing a variety of cutting-edge analytical solutions that have an impact at scale while building strong technical teams. Kishore has filed 8 patents at the intersection of machine learning, healthcare, and operations. Prior to this book, he authored four books in the fields of machine learning and deep learning. Kishore got his MBA from IIM Calcutta and his engineering degree from Osmania University.

Yeshwanth Reddy is a senior data scientist with a strong focus on the research and implementation of cutting-edge technologies to solve problems in the health and computer vision domains. He has filed four patents in the field of OCR. He also has 2 years of teaching experience, where he delivered sessions to thousands of students in the fields of statistics, machine learning, AI, and natural language processing. He has completed his MTech and BTech at IIT Madras.

PacktPublishing / Modern-Computer-Vision-with-PyTorch-2E

readme