What is the rtx-translate adaptor?

javiabellan commented 4 months ago

My question is what is rtx-translate and how is it useful somehow?

Steps to reproduce:

import torch

radioModel = torch.hub.load('NVlabs/RADIO', 'radio_model', version='radio_v2', progress=True, adaptor_names=["clip", "openai_clip", 'dino_v2', "sam", "rtx-translate"])

inp = torch.rand(1, 3, 256, 256)
out = radioModel(inp)

for out_name,(summary, features) in out.items():
    print(f"{out_name:<10}\t{summary.shape}\t{features.shape}")

This prints:

backbone    torch.Size([1, 2560])   torch.Size([1, 256, 1280])
clip        torch.Size([1, 1024])   torch.Size([1, 256, 1280])
openai_clip torch.Size([1, 768])    torch.Size([1, 256, 1024])
dino_v2     torch.Size([1, 1536])   torch.Size([1, 256, 1536])
sam         torch.Size([1, 1280])   torch.Size([1, 256, 1280])
rtx-translate   torch.Size([1, 128])    torch.Size([1, 256, 2048])

UPDATE

Looking at the adaptor config i can see some OCR datasets:

{'type': 'rtx_translate',
'name': 'rtx-translate',
'model': 'quality',
'feature_distillation': True,
'fd_normalize': False,
'fd_loss_fn': 'MSE',
'input_size': 1024,
'use_summary': False,
'fd_ohem': True,
'amp': True,
'data_dir': [
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/publaynet/webdataset', 0.4], 
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/staging/arxiv/hocr', 0.4], 
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/scene-text/scene-text/text_ocr/webdataset', 0.15],
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/scene-text/scene-text/hiertext/webdataset', 0.05]
],
'batch_size': 2,
'sample_rate': 2,
'summary_loss_weight': 1e-05,
'fd_loss_weight': 0.13,
'vitdet_prob': 0.99,
'vitdet_window_sizes': [8, 16, 16],
'student_resolution': 1024,
'fd_upsample_factor': 4}

New question: How i can use the rtx_translate feats to do OCR ?

mranzinger commented 4 months ago

Hello,

So it's an internal OCR model that we have, and RADIO was learning to match intermediate features. We don't have an integration API for it, but the idea is that it helps the backbone to explicitly model text features. We have unpublished results suggesting that indeed, at high resolution (>= 1024), RADIOv2 does capture really strong text features. The way you'd use it is by connecting the backbone to some other OCR system and training at least the non-backbone part of that model. It should be compatible with the usual suspects, such as Faster-RCNN, or even by using a transformer decoder to read out text in a manner similar to Pix2Struct.

javiabellan commented 4 months ago

Ok thanks for the response, so the 2048dim feats from the rtx-translate head was only for OCR training purposes, therefore I can not use that head for inference right?

mranzinger commented 4 months ago

Yeah, I don't think you'll get very much use out of them. The backbone features should have pretty strong OCR priors though, so if you're looking to do that sort of thing, give that a try.

NVlabs / RADIO

What is the rtx-translate adaptor? #40

UPDATE