Tejas-Haritsa-vk commented 12 months ago

Hi, While trying to make use of the repo I came across an error for missing vit pretrained weights. Which led me to find out that the entire pretrained folder is missing. kindly update the same.

line 58 in model/model.py if arch_config == 'base_patch16_224':

vit_model = timm.models.vision_transformer.vit_base_patch16_224(pretrained=pretrained)

            vit_model = torch.load("pretrained/jx_vit_base_p16_224-80ecf9dd.pth", map_location="cpu") <--- this one

thechargedneutron commented 12 months ago

Hi @Tejas-Haritsa-vk ,

The pretrained folder contains all pretrained models from other works. The one you are looking for can be downloaded from here.

Let me know if you need anything more.

Tejas-Haritsa-vk commented 12 months ago

Hi @thechargedneutron, Thanks for the file. It's working now. Also I noticed that in model.py line 89 has

self.head_ht100m_linear_probe = nn.Sequential( nn.Linear(projection_dim, 100) )

which is not being utilized in forward but, it is being called in _train_epoch, trainer_howto100m_classification.py line 134 for video_predictions and the loss is computed based on this alone and not the actual connected model which produces video_embeddings. How does this work?

Edit: Is that only meant to act as a non-learnable projection layer?

thechargedneutron commented 12 months ago

Yes, head_ht100m_linear_probe is not called in any other code except in case of HowTo100M linear probing. And in trainer_howto100m_classification.py it is learnable. The input to this layer is the output of the video backbone. I set only this layer as trainable but you can finetune the whole model if you unfreeze the backbone.

Tejas-Haritsa-vk commented 11 months ago

Thank you @thechargedneutron that was really helpful. Also, I'm not sure if I'm doing anything wrong here but, when I try to train on howto100m using the provided train script, It is running only on single GPU (train_howto100m). Can you please help me resolve this?

Edit: when I use worldsize >1 rank=0, DDP hangs (distributed_howto100m).

gets stuck after printing: Use GPU: n for training

thechargedneutron commented 11 months ago

Are you using the SLURM code provided? I used SLURM to achieve ddp. If you don't have SLURM, you can suitably modify the code to get the desired ddp. The reason for hanging is because the main process is waiting to communicate to other processes and it times out after some time (default is 30 mins).

One suggestion is to use some simpler ddp code and then try to see how that differs from the code provided in this repo. Something like https://github.com/ShigekiKarita/pytorch-distributed-slurm-example might be helpful if you are starting with SLURM.

Tejas-Haritsa-vk commented 11 months ago

I'm using the distributed_howto100m.py script that is provided inside run folder. I don't remember seeing or running SLURM script.

P.S I'm quite new to pytorch as I usually work with TF, so some detailed guide would really help me out a long way.

Thanks again for your quick responses.

thechargedneutron commented 11 months ago

Instead of using distributed_howto100m.py, can you try using train_howto100m.py? That is better suited for one node job. I guess the code is modular enough to replace one of train_howto100m.py or distributed_howto100m.py with any other wrapper that is working on your system. Replace the dataloaders from the sample code with those used in this repo. You can ask here if that is not working out.

Tejas-Haritsa-vk commented 11 months ago

I have tried using train_howto100m.py as well but it also gets stuck for world size>1. and I also noticed that for batchsize>1 it throws cuda memory error by a small margin even on a tesla V100 (32GB). Can you please help me out with this one? I'm not sure how to proceed as I have tried making changes as per pytorch forum regarding the above issues but no luck with both "nccl" & "gloo" backends.

P.S I'm not sure if this is important, I noticed that in the link that you shared for SLURM they are using "model = torch.nn.parallel.DistributedDataParallel(model)" which is not used in the train_howto100m.py would that have an impact on using all gpu?

Edit: I have narrowed down the issue to stem from dist.init_process_group(). This hangs with world_size>1. Even just running it by itself in a separate script it hangs. Can you please help me resolve this? So far I have tried switching the backend to gloo, init_method to localhost, env:// but, none have worked.

However, I am able to run this code without any issue:

import os
import tempfile
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12345'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

def demo_checkpoint(rank, world_size):
    print(f"Running DDP checkpoint example on rank {rank}.")
    setup(rank, world_size)

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
    if rank == 0:
        # All processes should see same parameters as they all start from same
        # random parameters and gradients are synchronized in backward passes.
        # Therefore, saving it in one process is sufficient.
        torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

    # Use a barrier() to make sure that process 1 loads the model after process
    # 0 saves it.
    dist.barrier()
    # configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
    ddp_model.load_state_dict(
        torch.load(CHECKPOINT_PATH, map_location=map_location))

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn = nn.MSELoss()
    loss_fn(outputs, labels).backward()
    optimizer.step()

    # Use a barrier() to make sure that all processes have finished reading the
    # checkpoint
    dist.barrier()

    if rank == 0:
        os.remove(CHECKPOINT_PATH)

    cleanup()

class ToyMpModel(nn.Module):
    def __init__(self, dev0, dev1):
        super(ToyMpModel, self).__init__()
        self.dev0 = dev0
        self.dev1 = dev1
        self.net1 = torch.nn.Linear(10, 10).to(dev0)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to(dev1)

    def forward(self, x):
        x = x.to(self.dev0)
        x = self.relu(self.net1(x))
        x = x.to(self.dev1)
        return self.net2(x)

def demo_model_parallel(rank, world_size):
    print(f"Running DDP with model parallel example on rank {rank}.")
    setup(rank, world_size)

    # setup mp_model and devices for this process
    dev0 = rank * 2
    dev1 = rank * 2 + 1
    mp_model = ToyMpModel(dev0, dev1)
    ddp_mp_model = DDP(mp_model)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    # outputs will be on dev1
    outputs = ddp_mp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(dev1)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()

if __name__ == "__main__":
    n_gpus = torch.cuda.device_count()
    if n_gpus < 8:
        print(f"Requires at least 8 GPUs to run, but got {n_gpus}.")
    else:
        run_demo(demo_basic, 8)
        run_demo(demo_checkpoint, 8)
        run_demo(demo_model_parallel, 4)

Tejas-Haritsa-vk commented 11 months ago

I was able to solve this to an extent by modifying the code from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#basic-use-case and https://github.com/ShigekiKarita/pytorch-distributed-slurm-example Will be closing this now.

@thechargedneutron Thank you for all your help.

facebookresearch / HierVL

Missing Files #4

vit_model = timm.models.vision_transformer.vit_base_patch16_224(pretrained=pretrained)