cj-mills / christianjmills

My personal blog
https://christianjmills.com/
Other
3 stars 0 forks source link

posts/pytorch-train-image-classifier-timm-hf-tutorial/ #39

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Christian Mills - Hands-On Hand Gesture Recognition: Fine-Tuning Image Classifiers with PyTorch and the timm library for Beginners

Learn how to fine-tune image classification models with PyTorch and the timm library by creating a hand gesture recognizer in this easy-to-follow guide for beginners.

https://christianjmills.com/posts/pytorch-train-image-classifier-timm-hf-tutorial/

botanikus commented 1 year ago

Hey Christian! Fantastic tutorial and very insightful. I'm eager to try it out. Quick question: For integrating the checkpoint with my timm model setup, would I just reference the checkpoint file path and set pretrained=False? Will there be an update regarding this: "I’ll cover how to load the model checkpoint we saved earlier and use it for inference in a future tutorial."

cj-mills commented 1 year ago

Hi @botanikus,

Sorry, I got sidetracked with other projects before making the inference tutorial. I added a notebook to this tutorial's GitHub repository showing how to perform inference:

Andrew-music commented 9 months ago

Hi! This is a fantastic tutorial! It's one of the best tutorials I've seen in years. I really like how you walk through every step, and check the results each time to be sure everything worked as intended. Just great! Your coding style is super clean and clear, and perfect for a tutorial.

I'm on a Mac Studio, so I'm using the version of PyTorch built for the Metal version of GPU acceleration for M1 chips, rather than CUDA. Walking through your tutorial step by step, all is well until I start train_loop(). Then I get an AssertionError: Torch not compiled with CUDA enabled. This happens inside of torch - somewhere along the line it got to thinking I had CUDA. Here's the error message (Jupyter notebook didn't let me copy the text, so I've typed it by hand - I hope I didn't make any typos!).

File ~/anaconda3/envs/torch/lib/python3.11/site-packages/torch/cuda/__init__.py:289, in _lazy_init()
    284      raise RuntimeError(
    285         "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
    286        "multiprocessing, you must use the 'spawn' start method"
    287      )
    288 if not hasattr(torch._C, "_cuda_getDeviceCount"):
--> 289     raise AssertionError("torch not compiled with CUDA enabled")

Can you help me get this sorted? Thank you!

Andrew-music commented 9 months ago

PS Torch is running fine for everything else I've been asking of it, including some training.

cj-mills commented 9 months ago

Hi @Andrew-music,

Sorry about that. I don't have a Mac to verify everything works on that platform. Try making the following changes to the code for initializing the DataLoaders.

# Set the number of worker processes for loading data. This should be the number of CPUs available.
num_workers = multiprocessing.cpu_count()

# Define parameters for DataLoader
data_loader_params = {
    'batch_size': bs,  # Batch size for data loading
    'num_workers': num_workers,  # Number of subprocesses to use for data loading
    # 'persistent_workers': True,  
    # 'pin_memory': True,  
    # 'pin_memory_device': device,  
}

# Create DataLoader for training data. Data is shuffled for every epoch.
train_dataloader = DataLoader(train_dataset, **data_loader_params, shuffle=True)

# Create DataLoader for validation data. Shuffling is not necessary for validation data.
valid_dataloader = DataLoader(valid_dataset, **data_loader_params)

# Print the number of batches in the training and validation DataLoaders
print(f'Number of batches in train DataLoader: {len(train_dataloader)}')
print(f'Number of batches in validation DataLoader: {len(valid_dataloader)}')

If that resolves the issue on your Mac, I'll add a section in the tutorial.

Andrew-music commented 9 months ago

Thank you for such a quick and helpful reply! It works now - here's what I did.

I replaced the code with your suggestion, and it gave me this error (again, hand-transcribed):

Attributeerror: Can't get attribute 'ImageDataset' on <module '__main__' (built-in)

RuntimeError: Dataloader worker (pid32060) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

So I set num_workers=0, reset the kernel, and ran everything again. This time the error was:

File ~/anaconda3/envs/torch/lib/python3.11/site-packages/torch/amp/autocast_mode.py:241, in autocast.__init__(self, device_type, dtype, enabled, cache_enabled)
    239     self.fast_dttype = self.custom_device_mod.get_autocast_dtype()
    240 else:
--> 241     raise RuntimeError(
    242         f"User specified an unsupported autocast device_type '{self.device}'"

I guessed that the problem might then be autocast, so I took out the with autocast(device) from run_epoch() and tried again. This seemed to work, so I kept this change and set num_workers back to the number of CPUs.

But that failed, giving me the DataLoader worker (pid32260) exited unexpectedly... error I got before.

So I set num_workers=0 again, and it started training.

So it seems that I needed both changes, in addition to your new code: set num_workers=0, and comment out the with autocast directive. I got roughly the same accuracy/loss numbers you got, but it runs about 6 times slower than your example, taking about 22 minutes/epoch (it's using only 1 of the 20 available cores). I'm grateful that this runs on the Mac at all, but ignoring all that horsepower is unfortunate.

Everything from that point to end of your tutorial worked fine.

Out of curiosity, why do you re-train the entire network rather than freezing all but the last few layers, or even freezing the whole thing and training a few new layers near the end?

cj-mills commented 9 months ago

@Andrew-music,

I replaced the code with your suggestion, and it gave me this error (again, hand-transcribed):

Attributeerror: Can't get attribute 'ImageDataset' on <module '__main__' (built-in)

RuntimeError: Dataloader worker (pid32060) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

Ah, you might need to use the approach to enable multiprocessing on Windows with Mac.

Out of curiosity, why do you re-train the entire network rather than freezing all but the last few layers, or even freezing the whole thing and training a few new layers near the end?

While making the tutorial, I tested the training performance of a staged unfreezing approach, starting by only training the new model head and gradually unfreezing the pre-trained weights. However, simply unfreezing the entire model resulted in better performance, at least for the smaller models like ResNet18-D used in the tutorial.

While showing the steps for the staged unfreezing approach could have educational value, the tutorial was already pretty long.