NuScenes Dataset Loader

maskedmeerkat commented 4 years ago

Hi,

first of all, I want to join the others in congratulating you for your great work and for even providing this well documented code! Thanks.

I am currently trying to write a data loader for the NuScenes dataset. The data split files are formatted as follows

sample_token | backward_context_png | forward_context_png

and the sample is then constructed using this routine

However, my training results look as follows evalImg_ep00002

One problem I encountered, which I am not certain whether it's related or not, is that I had to change the cfg.checkpoint.monitor entry from "loss" to "abs_rel", since the entry "loss" is not initialized if training is False.

My questions are:

Do you already have a NuScenes Dataloader which you could share, since you mentioned experimenting with NuScenes in your paper, too?
Did you experience such a behavior before or have any insights on what might go wrong?
Why did you set the cfg.checkpoint.monitor = "loss" when "loss" cannot be defined in the validation step ... I am guessing, that I did something wrong but don't know what...
How shall the gt depth maps be defined? Zeros everywhere, where we don't have detections and depth on pixels with detections or do you directly specify them as inverse depth maps?

Thanks in advance for your time and help.

VitorGuizilini-TRI commented 4 years ago

Thank you very much, we are glad you are finding our codebase useful. We are gradually porting internal functionality to the public repository, other dataset loaders are on the list, but I am not sure when they will be added. About your questions:

Yes, you have to change "loss" to whatever metric you want to monitor depending on the dataset and application (I will change it to a better default argument). However, that has no effect to the results you are getting, that's only for checkpoint saving.
How are you training? For how long, on how many images, what is the learning rate, what is the image resolution? These could be the reason, maybe try a few different variations on a smaller overfit set to see if it starts to converge. Using one of our pretrained models is useful as well, to bootstrap training. Is the loss going down at training time?
GT depth maps are defined in metric depth (not as inverse depth maps), with zeros where there is no valid depth information.

I hope this helps!

maskedmeerkat commented 4 years ago

Thanks for the fast reply. Okay, those are some good pointers that clarify all of my questions and give me some things to look into.

I will try it out.

maskedmeerkat commented 4 years ago

Hi Vitor,

I looked into your suggestions. I think the currently published version cannot handle sparse depth maps. You can see this by first of all looking at the results where the depth seems okay in some horizontal lines in the image but goes to max range elsewhere (why max range and not min range is also strange to...) evalImg_ep00006

I then looked into the code and found, that the "sparse" in "sparse-L1" is ignored in this function

Could you check, if the sparse-L1 is really supported in the published version and if not, could you give me pointers how to adapt the code?

Moreover, do you have experimented with average spatial distance between context and target images? Cause, I am wondering if I can take the NuScenes "samples" (their keyframes) as context, which are far apart or whether I have to use intermediate sweeps. The sweeps don't come with pose information but are spatially much closer.

Thank you for your efforts.

VitorGuizilini-TRI commented 4 years ago

The "sparse" part is taken into consideration here:

https://github.com/TRI-ML/packnet-sfm/blob/f824ffceba46ae1c621e1bf22a35634d8b39207c/packnet_sfm/losses/supervised_loss.py#L140

by masking out all the pixels in both predicted and ground-truth depth maps that have ground-truth depth > 0.0. One thing I should mention is that the supervised loss is actually applied on the inverse depth maps, not on the depth maps. One of our next updates will address that and include the option of doing one or the other, but in the meantime you might want to "invert the inverse depth maps" back and see if that helps.

We have experimented with KITTI, trying different contexts, and definitely there is a limit as to how far context images can be before training breaks. They also cannot be too close, otherwise there is not enough motion, so for each dataset there is a "sweet spot" where self-supervised training works. Also, you don't need pose to train, so you can try with intermediate sweeps both adjacent and with strides, maybe some skipping will get you to the right baseline.

maskedmeerkat commented 4 years ago

First of all, thanks for your reply. So you suggest, I move the depth2inv(...) trafo from here https://github.com/TRI-ML/packnet-sfm/blob/f824ffceba46ae1c621e1bf22a35634d8b39207c/packnet_sfm/models/SemiSupModel.py#L102-L104 Inside this loss function https://github.com/TRI-ML/packnet-sfm/blob/f824ffceba46ae1c621e1bf22a35634d8b39207c/packnet_sfm/losses/supervised_loss.py#L123 and change the loss function the sth. like this:

def calculate_loss(self, inv_depths, gt_depths): 
"""  
Calculate the supervised loss.
    Parameters
    ----------
    inv_depths : list of torch.Tensor [B,1,H,W]
        List of predicted inverse depth maps
    gt_depths : list of torch.Tensor [B,1,H,W]
        List of GROUND-TRUTH DEPTH MAPS
    Returns
    -------
    loss : torch.Tensor [1]
        Average supervised loss for all scales
    """
    # COMPUTE INVERSE DEPTH MAPS HERE
    gt_inv_depths = depth2inv(gt_depths)

    # If using a sparse loss, mask invalid pixels for all scales
    if self.supervised_method.startswith('sparse'):
        for i in range(self.n):
            # USE GT DEPTH MAPS HERE INSTEAD OF INV DEPTH MAPS
            mask = (gt_depths[i] > 0.).detach()
            inv_depths[i] = inv_depths[i][mask]
            gt_inv_depths[i] = gt_inv_depths[i][mask]
    # Return per-scale average loss
    return sum([self.loss_func(inv_depths[i], gt_inv_depths[i])
                for i in range(self.n)]) / self.n

VitorGuizilini-TRI commented 4 years ago

Did you manage to get it working? I don't think it makes any difference where you do the inversion (if it's at the model level or at the loss level), it should work the same way. When we do the depth2inv inversion the invalid depth pixels are kept as 0.0, so you can still mask them out. What you can do is apply inv2depth on the predicted inverse depth maps and keep the ground-truth depth maps untouched, so the loss is applied directly on depth maps.

maskedmeerkat commented 4 years ago

Yeah, I also found that the default values are indeed kept untouched and hence changed everything back ^^ Also found a mistake of mine after which the network currently seems to train properly. Can post the results in a couple of days. Currently it needs for all NuScenes samples on one GPU about 7h training per epoch.

Without supervision, it still fails. Hence, I want to try different context distances or maybe even two backward and forward contexts. Maybe that'll stabilize it.

Will keep on posting my findings for other people trying to use NuScenes.

iariav commented 4 years ago

@maskedmeerkat Hi, i'm also thinking of using the NuScenes dataset for training. can you please share your working NuScenes dataloader?

thanks

maskedmeerkat commented 4 years ago

Currently working on using different contexts. When it's working, I'll post an update here.

pjckoch commented 3 years ago

Hi @maskedmeerkat,

did you manage to get rid of the horizontal lines after all? I am encountering a similar problem with my own model training on Nuscenes.

Any help is appreciated :)

TRI-ML / packnet-sfm

NuScenes Dataset Loader #32