alexjunholee / EventVLAD

codebase for the 2021 IROS paper "EventVLAD: Visual Place Recognition with Reconstructed Edges from Event Cameras"
11 stars 1 forks source link

How to run complete pipeline? #3

Closed Tobias-Fischer closed 2 years ago

Tobias-Fischer commented 2 years ago

Dear @alexjunholee, many thanks for quickly resolving #1 and #2, that's awesome!

I am still confused what your VPR pipeline looks like. How do I obtain the NetVLAD features given an event stream? It seems like https://github.com/alexjunholee/EventVLAD/blob/main/create_denoised_samples.py takes intensity images as input, but how do I obtain those given an event stream?

My aim is to have the typical VPR pipeline where I extract the features for the reference traverse, and then compare the query features of a query event stream with the reference features to find the closest match.

Many thanks again, Tobi

alexjunholee commented 2 years ago

Dear @Tobias-Fischer, Thanks for paying attention to our work again!

The sample codes provided assume you've already preprocessed the event stream into subsequent event frames,

Which is produced by simply accumulating events by the threshold based on the temporal window, and later filtered by event numbers. The procedure is as follows: (1) find the number of events in the fixed time interval (5ms) (2) if the number of events is too small (<5% of total pixels), skip the frame (the car is probably stopped). (3) if the number of events is too large (>10% of total pixels), reduce the temporal window to contain events of 10% pixels. (4) with the adjusted window size, construct an event frame. (5) repeat until the event stream becomes empty.

You need to feed 3 subsequent event frames into the event denoiser we provided, as the way we created a denoised image in the code create_denoised_samples.py. The remaining procedure is identical to https://github.com/Nanne/pytorch-NetVlad. You could feed the denoised image into the Imagenet_vgg encoder with pretrained weights we provided.

We could have provided the code for this preprocessing, but (a) The code is old and is actually quite a bit unordered, and (b) not appropriate for the event stream because it precomputes the whole event stream into images.

Thanks to your attention, I've discovered that I've missed out on some preprocessing stuff, so thanks again!

I hope this helped! Best, Alex

Tobias-Fischer commented 2 years ago

Hi there,

Many thanks for this!

I have written some basic code (super slow and non-optimised, just for testing) following your description (see attached) - could you please have a look at whether that looks okay? build_event_frames_event_vlad.py

If I feed in three subsequent event frames to your create_denoised_samples.py, I get the following result - does that look how you would expect it to look?

Figure_1

Do I understand that I then feed the denoised image (middle column) to your NetVLAD layer? Is the mask used for anything?

Many thanks, Tobi

alexjunholee commented 2 years ago

Hi Tobias,

Thanks for keeping attention to our result! I've checked your run result, and I guess the network is not running properly. The masked denoised image in the right is the input to the NetVLAD layer, but it is apparently failing in the test case.

I assume the error comes from the different noise characteristics of each event sensor, especially because our model was trained only with the event simulator in this work. In successful cases, the output should look like the below. image

These are the lists you may try, to use our module for denoising.

  1. Just ignore the output of the masking layer, it does not enhance the denoising result very much but takes a long time to train. You can use the image in the middle column if it produces a viable result.
  2. The 5% and 10% numbers were set up to the imaginary event sensor with a resolution of 512*512 pixels. I guess your image is smaller than that resolution, therefore using more event numbers may be more appropriate for the test case.
  3. Maybe try fine-tuning the denoiser weights? I've updated the loss.py file to contain the loss function I used during training. The noise models of events in day and night would be different, but I guess fine-tuning the model only with RGB edges (day) - Event (day) would help.

Thanks! Alex

Tobias-Fischer commented 2 years ago

Hi Alex,

Many thanks for your quick response! May I ask which settings you used for the Brisbane-Event-VPR dataset (346x260 resolution)? This would help a lot. Also, if you have the denoised images (or NetVLAD features) for the Brisbane-Event-VPR dataset somewhere and could share them, this would be highly appreciated! I am trying to replicate Fig 6 of your paper.

Thanks, Tobi

alexjunholee commented 2 years ago

Hi Tobi,

I could not find the old files and weights used for evaluation, but I've found the settings for them. The size of the temporal window was 66ms, with no upper limit of events, and pass the frame if there were less than 1% events.

But fortunately, I still have the bagfile of the Brisbane dataset, and I will try regenerating event image files and provide the fine-tuned denoiser on the Brisbane dataset and share the results asap.

Thanks! Alex

Tobias-Fischer commented 2 years ago

Many thanks for that, this would be highly appreciated!

alexjunholee commented 2 years ago

Hi Tobi,

I've updated the additional weight file for the Brisbane dataset. It is based on the Carla weights but fine-tuned with the Morning sequence. You can download it with this link! The masking layer is only trainable in the simulation, so it was not trained. I guess you can just skip the masking part!

The output using the setting above (66ms / no upper limit / 1% reject), you will see the results like in below.

Sample in Daytime image

Sample in Midnight image

Thanks! Alex

Tobias-Fischer commented 2 years ago

Hi Alex,

Many thanks for fine-tuning the weights and providing them! Unfortunately, I am still having trouble obtaining the same results that you do.

I am assuming it has got something to do with the removal of hot pixels, or some other technical detail.

For these three subsequent images: 529 650 529 716 529 782

I am obtaining the following result: Figure_1

Would it be possible to share how you extract the event frames from the bag files, or even the denoised images? I can provide you with storage space if that is an issue.

Many thanks, Tobi

Tobias-Fischer commented 2 years ago

Here is, for reference, the output of the same place that you used in your daytime sample. Both noisy image and denoised image look very different, but I can't figure out why :(.

038 676

Best, Tobi

alexjunholee commented 2 years ago

Hi Tobi,

114 730462 This is t = 114.730462s of daytime sequence. Maybe it's just visualization, but the event Images you've provided look like they do not contain much of the events. The event image should look like the above, with the provided settings. Therefore I've also uploaded the preprocessing code (which is a little bit hacky Matlab)

I've also tested the event image in the sample above, with the script create_denoised_samples.py. image

Did this help? All the reconstructed edges are without applying masks, to make sure.

Best, Alex

Tobias-Fischer commented 2 years ago

Thanks Alex - that has helped! I appreciate that.

I was able to run the create_denoised_sample.py script and save the edge images. I am now trying to run the NetVLAD feature extraction part. However, the edge images are single channel (i.e. grayscale): https://github.com/alexjunholee/EventVLAD/blob/a92cc827b7b0b98a8c43567c775919f795937933/create_denoised_samples.py#L140 while your EventVLAD.py script expects 3-channel colour images: https://github.com/alexjunholee/EventVLAD/blob/a92cc827b7b0b98a8c43567c775919f795937933/networks/EventVLAD.py#L10 This naturally leads to:

RuntimeError: output with shape [1, 256, 256] doesn't match the broadcast shape [3, 256, 256]

Could you please let me know which transform I need to use to pass the edge images to your EventVLAD layer?

On another note, the edge images also seem fairly dark overall; is that correct? Example below: 067 716

Tobias-Fischer commented 2 years ago

I also notice that the imageSize is set to 224x224 while the denoising code outputs sizes 256x256

alexjunholee commented 2 years ago

Hi Tobi,

Also, in my case, the edge images were dark overall, as the images are normalized in the samples above. The performance was not seemingly affected by this dark-ish behavior.

And sorry for missing out on that part; you can duplicate the original image into three channels. This comes from the original VGG structure, receiving three channels as input. Also, the image should be resized to 224x224. I've used the following line for that. img = cv2.resize(img, dsize=(224,224), interpolation = cv2.INTER_LINEAR)

Thanks again for your work to reproduce our result. If there are any queries afterward, please feel free to ask us.

Best, Alex

Tobias-Fischer commented 2 years ago

Thanks Alex for your response! Could you also please share how you normalize/transform the images? Do you use cv2 to read the image? Then do you subtract mean (if so, which one) and divide by std deviation (if so, which one)?

It would be great if you could provide the full code from reading the images to the forward pass - getting some details wrong could result in wrong behaviour which would be hard to spot.

alexjunholee commented 2 years ago

Hi Tobi, the images were not normalized in our experiment. I've found that I've implemented the normalization layer as mean = 0.500, std = 0.250, but haven't applied it.

A data loader returns an image by following,

def __getitem__(self, index):
    img = cv2.imread(self.qImages[index], 1)
    img = img.squeeze()
    img = cv2.resize(img, dsize=(224, 224), interpolation=cv2.INTER_LINEAR)
    img = torch.from_numpy(img.astype('float')/255.)
    img = img.permute(2, 0, 1)
    return img

Then, a tensor is passed to the network, identical to the NetVLAD pipeline.

for iteration, data in enumerate(data_loader['test']):
    input = data.to(device).float()
    image_encoding = model.module.encoder(input)
    vlad_encoding = model.module.pool(image_encoding[:,:,np.newaxis,np.newaxis])

    qFeat[iteration,] = vlad_encoding.detach().squeeze().cpu().numpy()

the other parts (faiss, prediction matrix etc) are as follows, which will be already familiar to you.

faiss_index = faiss.IndexFlatL2(pool_size)
faiss_index.add(dbFeat)
recall_n = [1,5,10]
faiss.cvar.distance_compute_blas_threshold = 100000
_, predictions = faiss_index.search(qFeat, max(recall_n))

distmat, fullpred = faiss_index.search(qFeat, len(fullset))
Tobias-Fischer commented 2 years ago

Thanks again. For your EventVLAD class, there is no encoder/pool added via add_module. So the code above unfortunately does not work. It would be great if you could clarify, or provide a full code snippet that can be run.

Best, Tobi

alexjunholee commented 2 years ago

Hi Tobi,

Thanks for your kind explanation and sorry I've missed that part. You can load and initialize the network by the following code:

    encoder = Imagenet_vgg(opt.pretrained)
    model = nn.Module()
    model.add_module('encoder', encoder)
    net_vlad = NetVLAD(num_clusters=opt.num_clusters, dim=encoder_dim, alpha = 1.0)
    initcache = "your/path/to/centroids.hdf5"
    with h5py.File(initcache, mode='r') as h5:
        clsts = h5.get("centroids")[...]
        traindescs = h5.get("descriptors")[...]
        net_vlad._init_params(clsts, traindescs)
        del clsts, traindescs
    model.add_module('pool', net_vlad)

    if torch.cuda.device_count() > 1:
        model.encoder = nn.DataParallel(model.encoder)
        model.pool = nn.DataParallel(model.pool)

    checkpoint = torch.load("your/path/to/checkpoints.pth.tar", map_location=lambda storage, loc:storage)
    model.load_state_dict(checkpoint['state_dict'], strict=False)
    model = model.to(device)

Best, Alex

Tobias-Fischer commented 2 years ago

Thanks! Just to confirm, the Imagenet_vgg is from your EventVLAD class? What is opt.pretrained? And do I need the initcache part?

alexjunholee commented 2 years ago

Hi Tobi,

Yes, the Imagenet_vgg is imported from EventVLAD.py, and as we load the weight from load_state_dict, you can just set it to None. And the centroids are created by get_clusters function as in here, as our implementation is based on the repo (pytorch-NetVlad). Thanks!

Best, Alex

Tobias-Fischer commented 2 years ago

Thanks - could you please provide the centroids as well? They are not provided in the other repo.

alexjunholee commented 2 years ago

Hi Tobi,

I have the Carla-trained centroids. But I guess It wouldn't help, as the distributions of centroids are dependent on datasets and trained weights. So it would be best to create with the weights and their train set, with the new fine-tuned VGG16 weights.

Best, Alex

Tobias-Fischer commented 2 years ago

Ok - I guess for just inference I don't need the centroids though, do I?

alexjunholee commented 2 years ago

For inference, you'll need to transform the database images (match reference) into clusters. That'll be used to extract residuals for query VLAD vectors.

Tobias-Fischer commented 2 years ago

Hi Alex,

Unfortunately this still does not work. I get the following error:

  File "/Users/fischert/mambaforge/envs/salient-event-vpr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Module:
    Unexpected key(s) in state_dict: "pool.lastfc.weight". 

Indeed there is no reference to lastfc in https://github.com/Nanne/pytorch-NetVlad. Could you please provide the full code that you used for the NetVLAD pooling layer?

Tobias-Fischer commented 2 years ago

Hi Alex,

I guess another issue is that the fully connected layers need to be removed from the backbone - something similar to what is done in: https://github.com/Nanne/pytorch-NetVlad/blob/8f7c37ba7a79a499dd0430ce3d3d5df40ea80581/main.py#L392-L393

I would really appreciate if you could provide a full working example - a very brief code snippet that loads a handful of images and extracts the features for these images. I don't mind if the code is messy etc. If you do not want to share it here publicly, please drop me an email to tobias.fischer@qut.edu.au

Tobias-Fischer commented 2 years ago

I also just realised that you probably made some changes in https://github.com/Nanne/pytorch-NetVlad/blob/8f7c37ba7a79a499dd0430ce3d3d5df40ea80581/netvlad.py#L8-L12

As their __init__ does not accept an alpha, but in your code snippet above you pass an alpha. It would be great if you could provide the modified NetVLAD class with a complete working example :)

alexjunholee commented 2 years ago

Hi Tobi,

The alpha parameter is not used in the network, but I forgot to remove it from the initialization. And as you've observed the lastfc layer is added to the model. Sorry for missing out on these and causing an interruption.

Thanks for your continuous effort to reproduce the results, I will contact you by email with this issue from here. Thanks!

Tobias-Fischer commented 2 years ago

Thanks Alex, I'm looking forward to your email!

Tobias-Fischer commented 2 years ago

Your email got blocked from my uni account - could you please resend to info@tobiasfischer.info ?

alexjunholee commented 2 years ago

Sure!

Tobias-Fischer commented 2 years ago

Note that https://github.com/alexjunholee/EventVLAD/blob/main/networks/netvlad.py has now been updated, and that the other trick is to vlad = model.pool(image_encoding[:,:,np.newaxis,np.newaxis]) as opposed to vlad = model.pool(image_encoding).