Open Oringa opened 1 year ago
What dataset did you use for your thermal data? Did you use LLVIP in the paper?
Done?
Where did you add the function "load_and_transform_thermal_data" exactly? I am facing a different issue though but this might help!, my issue is this: Given groups=1, weight of size [768, 1, 16, 16], expected input[3, 3, 224, 224] to have 1 channels, but got 3 channels instead
Thanks in advance!
Also I think there is a typo in line 2, replace image_paths with thermal_paths
Hi, here to recommend our work, which is LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. We open source all training and validation code.
LanguageBind can be disassembled into different branches to handle different tasks.
print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy()) print("Video x Thermal: \n", torch.softmax(embeddings['video'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy()) print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy()) print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
@LinB203 I have tried your work, but I have run inference.py in the code multiple times and the output results are inconsistent each time. Therefore, I guess there may be an error somewhere. Please verify this issue.
Following issue 14, I created a small example for thermal embedding. While the Vision x Text and Thermal x Text are working properly, it seems the Vision x Thermal does not yield the correct result.
And the results are: