VamosC / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".
Apache License 2.0
115 stars 14 forks source link

请问训练时图片的尺寸大小是会resize成224*224还是128*32呢? #5

Closed Echhoo closed 9 months ago

Echhoo commented 10 months ago

patch_size是16*16吗?

VamosC commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ]

图像尺寸224X224, PATCH SIZE 16X16.

Echhoo commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ]

图像尺寸224X224, PATCH SIZE 16X16.

还想问下您clip模型中scale = width -0.5的作用是什么呢?为什么要加scale,而且是width -0.5。十分感谢

VamosC commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ] 图像尺寸224X224, PATCH SIZE 16X16.

还想问下您clip模型中scale = width -0.5的作用是什么呢?为什么要加scale,而且是width -0.5。十分感谢

https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L215

这个是transformer模型embedding初始化数值的一种方法吧。具体为什么这么做我也没有细究过。

Echhoo commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ] 图像尺寸224X224, PATCH SIZE 16X16.

还想问下您clip模型中scale = width -0.5的作用是什么呢?为什么要加scale,而且是width -0.5。十分感谢

https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L215

这个是transformer模型embedding初始化数值的一种方法吧。具体为什么这么做我也没有细究过。

if self.refine_iters:
            # For iterative refinement, we always use a 'cloze' mask.
            # We can derive it from the AR forward mask by unmasking the token context to the right.
            query_mask[torch.triu(torch.ones(num_steps, num_steps, dtype=torch.bool, device=self._device), 2)] = 0 
            bos = torch.full((bs, 1), self.bos_id, dtype=torch.long, device=self._device)
            for i in range(self.refine_iters):
                print("_______________refine_iter______________",self.refine_iters)
                # Prior context is the previous output.
                tgt_in = torch.cat([bos, logits[:, :-1].argmax(-1)], dim=1) # 排除最后一位(eos)的argmax 字符概率值
                tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)  # mask tokens beyond the first EOS token.
                logits, visual_vec = self.visual_decode(tgt_in, memory,
                                                    tgt_query=vis_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)
                if self.use_language_model:
                    tgt_in = torch.cat([bos, cross_logits[:, :-1].argmax(-1)], dim=1)
                    tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)
                    cross_logits, cross_vec = self.cross_decode(logits, tgt_in, memory,
                                                    tgt_query=crs_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)

请问system.py中forward这个地方是在模拟parseq中的随机mask吗?如果不是的话他的作用是什么呢 谢谢

VamosC commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ] 图像尺寸224X224, PATCH SIZE 16X16.

还想问下您clip模型中scale = width -0.5的作用是什么呢?为什么要加scale,而且是width -0.5。十分感谢

https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L215 这个是transformer模型embedding初始化数值的一种方法吧。具体为什么这么做我也没有细究过。

if self.refine_iters:
            # For iterative refinement, we always use a 'cloze' mask.
            # We can derive it from the AR forward mask by unmasking the token context to the right.
            query_mask[torch.triu(torch.ones(num_steps, num_steps, dtype=torch.bool, device=self._device), 2)] = 0 
            bos = torch.full((bs, 1), self.bos_id, dtype=torch.long, device=self._device)
            for i in range(self.refine_iters):
                print("_______________refine_iter______________",self.refine_iters)
                # Prior context is the previous output.
                tgt_in = torch.cat([bos, logits[:, :-1].argmax(-1)], dim=1) # 排除最后一位(eos)的argmax 字符概率值
                tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)  # mask tokens beyond the first EOS token.
                logits, visual_vec = self.visual_decode(tgt_in, memory,
                                                    tgt_query=vis_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)
                if self.use_language_model:
                    tgt_in = torch.cat([bos, cross_logits[:, :-1].argmax(-1)], dim=1)
                    tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)
                    cross_logits, cross_vec = self.cross_decode(logits, tgt_in, memory,
                                                    tgt_query=crs_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)

请问system.py中forward这个地方是在模拟parseq中的随机mask吗?如果不是的话他的作用是什么呢 谢谢

不是随机mask, 是做iterative refinement,将前一步的预测结果tgt_in当做context再去进行一次或者多次decode,提高预测正确率

Echhoo commented 10 months ago

model: img_size: [ 224, 224 ] # [ height, width ] patch_size: [ 16, 16 ] # [ height, width ] 图像尺寸224X224, PATCH SIZE 16X16.

还想问下您clip模型中scale = width -0.5的作用是什么呢?为什么要加scale,而且是width -0.5。十分感谢

https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L215 这个是transformer模型embedding初始化数值的一种方法吧。具体为什么这么做我也没有细究过。

if self.refine_iters:
            # For iterative refinement, we always use a 'cloze' mask.
            # We can derive it from the AR forward mask by unmasking the token context to the right.
            query_mask[torch.triu(torch.ones(num_steps, num_steps, dtype=torch.bool, device=self._device), 2)] = 0 
            bos = torch.full((bs, 1), self.bos_id, dtype=torch.long, device=self._device)
            for i in range(self.refine_iters):
                print("_______________refine_iter______________",self.refine_iters)
                # Prior context is the previous output.
                tgt_in = torch.cat([bos, logits[:, :-1].argmax(-1)], dim=1) # 排除最后一位(eos)的argmax 字符概率值
                tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)  # mask tokens beyond the first EOS token.
                logits, visual_vec = self.visual_decode(tgt_in, memory,
                                                    tgt_query=vis_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)
                if self.use_language_model:
                    tgt_in = torch.cat([bos, cross_logits[:, :-1].argmax(-1)], dim=1)
                    tgt_padding_mask = ((tgt_in == self.eos_id).cumsum(-1) > 0)
                    cross_logits, cross_vec = self.cross_decode(logits, tgt_in, memory,
                                                    tgt_query=crs_pos_queries, tgt_query_mask=query_mask[:, :tgt_in.shape[1]],
                                                    content_mask=content_mask, tgt_padding_mask=tgt_padding_mask,)

请问system.py中forward这个地方是在模拟parseq中的随机mask吗?如果不是的话他的作用是什么呢 谢谢

不是随机mask, 是做iterative refinement,将前一步的预测结果tgt_in当做context再去进行一次或者多次decode,提高预测正确率 在train_step的时候,


# After the second iteration (i.e. done with canonical and reverse orderings),
# remove the [EOS] tokens for the succeeding perms
if i == 1:
tgt_out = torch.where(tgt_out == self.eos_id, self.pad_id, tgt_out)
n = (tgt_out != self.pad_id).sum().item()

为什么迭代到i==1时才去掉eos标签呢,这里的用意是什么?
同时我在打印模型输出时,我发现forward函数只在dataloader中调用,在epoch训练时只执行了training_step中的步骤。那类VL4STR中training_Step和forward函数有什么关系吗?
非常感谢作者解答
VamosC commented 10 months ago

好问题,这个问题我也没有仔细思考过。

你可以去看下PARSeq和CLIP4STR,https://arxiv.org/pdf/2207.06966.pdf Table 1 和 https://arxiv.org/pdf/2305.14014.pdf Table 1. [E] token的输出总是依赖于所有的之前的token,我想只算两次的原因是因为不想这部分重复的loss计算稀释其他loss的占比吧,我能想到的就这个原因了。

forward函数在inference的时候会被调用,去看下base class https://github.com/VamosC/CLIP4STR/blob/main/strhub/models/base.py. training_step会cover很多不同的mask, forward的时候实际上只用到了自回归形式的mask,还有iterative refinement的clozemask。train和forward的过程实际上是一致的。

Echhoo commented 10 months ago

好问题,这个问题我也没有仔细思考过。

你可以去看下PARSeq和CLIP4STR,https://arxiv.org/pdf/2207.06966.pdf Table 1 和 https://arxiv.org/pdf/2305.14014.pdf Table 1. [E] token的输出总是依赖于所有的之前的token,我想只算两次的原因是因为不想这部分重复的loss计算稀释其他loss的占比吧,我能想到的就这个原因了。

forward函数在inference的时候会被调用,去看下base class https://github.com/VamosC/CLIP4STR/blob/main/strhub/models/base.py. training_step会cover很多不同的mask, forward的时候实际上只用到了自回归形式的mask,还有iterative refinement的clozemask。train和forward的过程实际上是一致的。

十分感谢你的回复!

VamosC commented 9 months ago

Close due to inactivity. Feel free to re-open.