[bug] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).)

jeonga0303 commented 5 months ago

I customized the config.py. how to train fine-tuning classification model?

jeonga0303 commented 4 months ago

how to convert Nc1..?

jeonga0303 commented 4 months ago

I tried changing the file configuration in the following order.

I'm training, but the data is big, so I'll let you know the results in the future

download pth file
config.py

_C.DATA.IMG_SIZE = 224
_C.MODEL.PRETRAINED = 'internimage_b_1k_224.pth'
_C.MODEL.NUM_CLASSES = 4

util.py ( Nc1: 1000 > Nc2: 4 ) convert load_pretrained function.

    if 'head.bias' in state_dict:
        head_bias_pretrained = state_dict['head.bias']
        Nc1 = head_bias_pretrained.shape[0]
        Nc2 = model.head.bias.shape[0]
        logger.info(f'{Nc1}, {Nc2}')
        if (Nc1 != Nc2):
            # head_weight = model.head.weight
            # head_bias = model.head.bias
            model.head.weight = torch.nn.Parameter(torch.zeros_like(model.head.weight))
            model.head.bias = torch.nn.Parameter(torch.zeros_like(model.head.bias))
            state_dict.pop('head.weight', None)
            state_dict.pop('head.bias', None)

dataset/samplers.py convert iteration.

def __iter__(self):
        # deterministically shuffle based on epoch
        g = torch.Generator()
        g.manual_seed(self.epoch)

        t = torch.Generator()
        t.manual_seed(0)

        indices = torch.randperm(len(self.dataset), generator=t).tolist()
        indices = [i for i in indices if i % self.num_parts == self.rank]

        # add extra samples to make it evenly divisible
        while len(indices) < self.total_size_parts:
            indices += indices[:(self.total_size_parts - len(indices))]

        indices = indices[:self.total_size_parts]
        assert len(indices) == self.total_size_parts, f'Length of indices ({len(indices)}) does not match total_size_parts ({self.total_size_parts})'

        # subsample
        indices = indices[self.rank // self.num_parts:self.total_size_parts:self.num_replicas // self.num_parts]

        index = torch.randperm(len(indices), generator=g).tolist()
        indices = list(np.array(indices)[index])

        assert len(indices) == self.num_samples, f'Length of indices ({len(indices)}) does not match num_samples ({self.num_samples})'

        return iter(indices)

cmd python -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --data-path [data-path] --pretrained internimage_b_1k_224.pth --batch-size 120

my gpu is a100 * 2.

If you use a huge dataset, use the following command python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --batch-size 256 --accumulation-steps 4 --pretrained internimage_b_1k_224.pth --data-path [data-path] --local-rank 1 --output work_dirs

2024.06.11 train success (image classification fine-tuning)

jeonga0303 commented 4 months ago

[bug]

I don't think there's progress in training.. Everything's the same as loss May I know the reason?

OpenGVLab / InternImage

[bug] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) #297