KeyError when training streetsurf on seg100613

amoghskanda commented 5 months ago

Firstly, great work and thanks for making it open-source. I setup everything following the readme for both streetsurf and nr3d. I wanted to use the withmask_nolidar.240219.yaml config file, made the path and sequence change to use seg100613(quick downloaded from streetsurf repo). The scenario.pt file is incomplete as waymo_dataset.py is accessing frame_timestamps(line 406) which is not a valid key in the scenario dictionary. There's another key error - line506 waymo_dataset.py, no global_timestamps key in the scenario['observers']['ego_car']['data'] dictionary. Can you share the complete scenario.pt file? or the zip file to segment-13476374534576730229_240_000_260_000_with_camera_labels sequence?

zzzxxxttt commented 5 months ago

I encountered the same issue, the problem was solved after checking out the latest commit (faba099e0feb11ea0089490a5e87565e25bc4a2c) and re-generate data.

zzzxxxttt commented 5 months ago

By the way if anyone encountered TypeError: __init__() takes 1 positional argument but 2 were given, just replace @torch.no_grad with with torch.no_grad(): in nr3d_lib/models/fields/nerf/lotd_nerf.py:

    # @torch.no_grad
    def query_density(self, x: torch.Tensor):
        with torch.no_grad():
            # NOTE: x must be in range [-1,1]
            ...

amoghskanda commented 5 months ago

@zzzxxxttt thank you for the reply. The key error persists. The problem is with the scenario.pt file as scenario['metas'] has no key under the name 'frame_timestamps'. Can you upload your scenario.pt file? This is for seg100613

zzzxxxttt commented 5 months ago

@amoghskanda sure, here it is scenario.zip

amoghskanda commented 5 months ago

Thank you for the scenario.pt file. @zzzxxxttt did you face the below error?

init() got an unexpected keyword argument 'fn_type' Line 183, train.py, MonoDepthLoss takes different parameters which are missing in the init of the class, defined in app/loss/mono.py class MonoDepthLoss

amoghskanda commented 5 months ago

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

zzzxxxttt commented 5 months ago

Thank you for the scenario.pt file. @zzzxxxttt did you face the below error?

init() got an unexpected keyword argument 'fn_type' Line 183, train.py, MonoDepthLoss takes different parameters which are missing in the init of the class, defined in app/loss/mono.py class MonoDepthLoss

No, I didn't met this error

zzzxxxttt commented 5 months ago

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

amoghskanda commented 5 months ago

so your data is loaded onto cache right? you did not make any changes when it comes to which device data and model are getting loaded onto? I have rtx3090 and data is loaded onto cpu, I run into RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) preload_on_gpu is false in withmask.yaml(by default) I did not make any changes as to which device

sonnefred commented 5 months ago

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

Hi, I also try to use withmask_nolidar.240219.yaml, but got an error when loading the images to make ImagePatchDataset. Have you met this error and how did you solve it? Thanks!

amoghskanda commented 5 months ago

yes, I removed **kwargs as an argument when calling get_frame_weights_uniform(), Line 66 dataloader/sampler.py because that function, defined later, takes only 2 arguments.

frame_weights = get_frame_weights_uniform(scene_loader, scene_weights)

sonnefred commented 5 months ago

yes, I removed **kwargs as an argument when calling get_frame_weights_uniform(), Line 66 dataloader/sampler.py because that function, defined later, takes only 2 arguments.

frame_weights = get_frame_weights_uniform(scene_loader, scene_weights)

Thank you for the reply, and I met a new error like this. Have you met this before?

amoghskanda commented 5 months ago

yes. I tried caching on gpu instead of cpu and changed the value of n_frames in the configs file from 163 to 30, for seg-10061, encountered the above error. When I reverted it to default settings(cache on cpu and 163), ran into #51

amoghskanda commented 5 months ago

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

cache is on the cpu right. The tensors frame_ind,h,w are on cpu as well. _ret_image_raw is on cpu as well. Not sure why I'm facing #51

sonnefred commented 5 months ago

yes. I tried caching on gpu instead of cpu and changed the value of n_frames in the configs file from 163 to 30, for seg-10061, encountered the above error. When I reverted it to default settings(cache on cpu and 163), ran into #51

Ok, have you solved the problem?

amoghskanda commented 5 months ago

not yet, on it. Try training without changing the size of n_frames from the config file. Lmk if you run into the same issue as me

sonnefred commented 5 months ago

not yet, on it. Try training without changing the size of n_frames from the config file. Lmk if you run into the same issue as me

Sorry, I'm trying to run code_multi, but got the error like this, have you met this before?

amoghskanda commented 5 months ago

@sonnefred , I used another config(with mask with lidar) and was able to train and render as well

amoghskanda commented 5 months ago

@zzzxxxttt did you try rendering nvs with different nvs paths like spherical_spiral or small_circle?

sonnefred commented 5 months ago

@sonnefred , I used another config(with mask with lidar) and was able to train and render as well

ok, thank you, but I'd like to use monodepth supervision, still working on it ...

sonnefred commented 5 months ago

I made some changes to mono.py and used MonoSDFDepthLoss and somewhat fixed it. I'm getting a RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu). This is because the cache is loaded on the cpu and everything else on gpu(cuda:0). Is there a fix to this? I preloaded cache onto gpu(RTX3090) but then it runs out of memory. I reduced n_frames in withmask_nolidar.240219.yaml for segment-100613 from 163 to 30, able to load cache camera onto gpu, I run into RuntimeError: The size of tensor a (65536) must match the size of tensor b (256) at non-singleton dimension 1. What was the batchsize when you trained? @ventusff @zzzxxxttt

I also use withmask_nolidar.240219.yaml and only modified the data location, I can train it on my 12G memory RTX3060 without error.

@zzzxxxttt Hi, how do you run this exp successfully? I still met a CUDA error when using this ymal ... Could you give any help? Thanks.

lhp121 commented 2 months ago

2024-06-11 19:16:01,146-rk0-train.py#959:=> Start loading data, for experiment: logs/streetsurf/seg100613.nomask_withlidar_exp1 2024-06-11 19:16:01,146-rk0-base.py#88:=> Caching data to device=cpu... 2024-06-11 19:16:01,146-rk0-base.py#95:=> Caching camera data... Caching cameras...: 0%| | 0/3 [00:00<?, ?it/s] Process finished with exit code 137 (interrupted by signal 9:SIGKILL)

Has anyone encountered this error before, and how can I adjust the parameters to make it run on my GTX 1660 Ti graphics card?

PJLab-ADG / neuralsim

KeyError when training streetsurf on seg100613 #46