Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
573 stars 147 forks source link

dataset for fsns experiment #98

Open runfengxu opened 3 years ago

runfengxu commented 3 years ago

When I convert the image data from tfrecord format to jpg formet, I found that, each jpg file is actually 4 square images concatenated together. And the the FileBasedDataset does nothing regarding that. And I don't see the FSNSLocalizationNet do separate localization for these 4 images. How to understand this?

if self.uses_original_data:

handle each individual view as increase in batch size

        batch_size, num_channels, height, width = images.shape
        images = F.reshape(images, (batch_size, num_channels, height, 4, -1))
        images = F.transpose(images, (0, 3, 1, 2, 4))
        images = F.reshape(images, (batch_size * 4, num_channels, height, width // 4))

does it consider 4 different images as an additional dimension for the localization?

Bartzi commented 3 years ago

Yes, FSNS is organized in such a way that one sample is actually comprised of 4 samples. The code snippet you refer to handles this case. If the flag uses_original_data is set to True the incoming image with a shape of (batch_size, 3, 150, 600) (height 150 pixels and width 600 pixels) is reorganized to a batch with the following shape (4, 3, 150, 150). We basically convert one image to 4 images and handle them independently. Later, they are fused together again.