Hello, thank you for sharing excellent work. Currently, the model calculates the similarity between text tokens and image features, selecting the top1 as its class. If I input a single sentence as text information (similar to Grounding DINO), will the model still work correctly? If so, how should it be modified?
Hello, thank you for sharing excellent work. Currently, the model calculates the similarity between text tokens and image features, selecting the top1 as its class. If I input a single sentence as text information (similar to Grounding DINO), will the model still work correctly? If so, how should it be modified?