DachunKai / EvTexture

[ICML 2024] EvTexture: Event-driven Texture Enhancement for Video Super-Resolution
https://dachunkai.github.io/evtexture.github.io/
Apache License 2.0
1.04k stars 67 forks source link

What shape should data be? #22

Open GlKz13 opened 1 month ago

GlKz13 commented 1 month ago

Hello! Thank you for your model! Can you clarify me one more thing? In your forward method the following is written:

"""Forward function of EvTexture

    Args:
        imgs: Input frames with shape (b, n, c, h, w). b is batch size. n is the number of frames, and c equals 3 (RGB channels).
        voxels_f: forward event voxel grids with shape (b, n-1, Bins, h, w). n-1 is intervals between n frames.
        voxels_b: backward event voxel grids with shape (b, n-1, Bins, h, w).

    Output:
        out_l: output frames with shape (b, n, c, 4h, 4w)
    """

Can you explain me how I should organize my data in, for example, calendar.h5 to feed the model? I mean in calendar.h5 there are "images" ([H, W]) and "voxels" ([Bins, H, W]) I tried to take 2 images, stacked them ( torch.stack([ image1, image2] ) then I took voxels between these 2 images ( that is one f_voxel and one backward voxel ) then unsqueeze everything to get this one batch( "b" in the forward function ) Finally we get this shapes: images: [1, 2, 3, H, W] voxels: [1, 1, 5, H, W] I tried to use the model: forward( images, voxels_f, voxels_b)

I really got an upscaled image but with awful quality so, what did I do wrong, I used test data published in this repo. I understand that I probably did smth wrong with shapes or wrongly organized the data BUT how exactly should I use h5 files with forward method? I want to know how to use "forward" method manually Thank you!

GlKz13 commented 1 month ago

`Here is my code by the way:

with h5.File("preproccessed/events/Vid4_h5/LRx4/test/calendar.h5", "r") as h: print("All frames ", len(list(h["images"]))) print(h.keys())

print(list(h["voxels_b"].keys()))

print(list(h['images']))
# take 2 images
image1 = h['images']['000000']
image2 = h['images']['000001']
image1 = np.array(image1)
image2 = np.array(image2)
# take voxels between them
vf  = np.array(h['voxels_f']['000000'])
vb  = np.array(h['voxels_b']['000000'])

# stack them to get n = 2
device = "cuda"
image1 = torch.tensor(image1).to(torch.float32).cuda().permute(2, 0, 1)
image2 = torch.tensor(image2).to(torch.float32).cuda().permute(2, 0, 1)
images = torch.stack([image1, image2]).unsqueeze(0)

vf = torch.tensor(vf).to(torch.float32).unsqueeze(0).unsqueeze(0)
vb = torch.tensor(vb).to(torch.float32).unsqueeze(0).unsqueeze(0)

device = "cuda"
model = EvTexture()
model_path = 'experiments/pretrained_models/EvTexture_Vimeo90K_BIx4.pth'
weights = torch.load(model_path, map_location=device)
model.load_state_dict(weights["params"])

model = model.to(device)
images = images.to(device)
vf = vf.to(device)
vb = vb.to(device)

model.eval()
with torch.inference_mode():
    res = model(images, vf, vb)

# res shape: (1, 2, 3, 576, 704)

`
DachunKai commented 1 month ago

Thank you for your interesting question about using only two frames as input and obtaining high-resolution output frames. Based on the shapes you've mentioned, they seem correct:

However, I have a question: Have you successfully tested the script ./scripts/dist_test.sh [num_gpus] options/test/EvTexture/test_EvTexture_Vid4_BIx4.yml and obtained the results posted in the release?

I can suggest a simple way for you to quickly test it. You just need to modify the meta_info_file in the config file (link), specifically basicsr/data/meta_info/meta_info_Vid4_h5_test.txt, to replace its content with calendar.h5 2. After that, run the test script options/test/EvTexture/test_EvTexture_Vid4_BIx4.yml, which will only test the first two images of the calendar and output the results.

I tested this and received the following results: image

for 000000.png, the PSNR is 23.64, and for 000001.png, it is approximately 23.60. The PSNR results in our release for the calendar images 000000/000001 are 25.26/25.40 respectively. image

I believe that inferring with only two frames leads to lower PSNR compared to using the entire video, as our model employs a recurrent structure, and two frames provide limited information, resulting in slightly poorer outcomes.

Hope this helps!

GlKz13 commented 1 month ago

Thank you, I'll try!