georghess / neurad-studio

[CVPR2024] NeuRAD: Neural Rendering for Autonomous Driving
https://research.zenseact.com/publications/neurad/
Apache License 2.0
356 stars 25 forks source link

Run train.py with pandaset error in docker container #7

Closed TurtleZhong closed 7 months ago

TurtleZhong commented 7 months ago

Hi, Following your step, I have build the docker image successfully, and I just wanna train the model with pandaset, I start the container use docker-compose, the neurad_docker.yaml file is

version: '3'
services:
  service1:
    container_name: nerf_neurad_studio
    image: neurad_studio:v0.1
    privileged: true
    runtime: nvidia
    network_mode: host
    devices:
      # - "/dev/shm:/dev/shm"
      - "/dev/nvidia0:/dev/nvidia0"
    environment:
      - DISPLAY=$DISPLAY
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
      - QT_X11_NO_MITSHM=1 # Fix a bug with QT
      - SDL_VIDEODRIVER=x11
    volumes:
      - "/tmp/.X11-unix:/tmp/.X11-unix:rw"
      - "/dev/shm:/dev/shm"
      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
      - "/lib/modules:/lib/modules"
    command: tail -f /dev/null

then I start the container and exec the command in the docker container:

sudo docker exec -it nerf_neurad_studio /bin/bash
python nerfstudio/scripts/train.py neurad pandaset-data

I got the error like this:

[12:15:12] Saving config to: outputs/unnamed/neurad/2024-04-24_121512/config.yml                experiment_config.py:139
           Saving checkpoints to: outputs/unnamed/neurad/2024-04-24_121512/nerfstudio_models              trainer.py:194
Traceback (most recent call last):
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 278, in <module>
    entrypoint()
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 269, in entrypoint
    main(
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 254, in main
    launch(
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 196, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 106, in train_loop
    trainer.setup()
  File "/nerfstudio/nerfstudio/engine/trainer.py", line 210, in setup
    self.pipeline = self.config.pipeline.setup(
  File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
    return self._target(self, **kwargs)
  File "/nerfstudio/nerfstudio/pipelines/ad_pipeline.py", line 62, in __init__
    super().__init__(config, **kwargs)
  File "/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 257, in __init__
    self.datamanager: DataManager = config.datamanager.setup(
  File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
    return self._target(self, **kwargs)
  File "/nerfstudio/nerfstudio/data/datamanagers/image_lidar_datamanager.py", line 195, in __init__
    super().__init__(config, device, test_mode, world_size, local_rank, **kwargs)
  File "/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py", line 163, in __init__
    self.train_dataparser_outputs: DataparserOutputs = self.dataparser.get_dataparser_outputs(split="train")
  File "/nerfstudio/nerfstudio/data/dataparsers/base_dataparser.py", line 171, in get_dataparser_outputs
    dataparser_outputs = self._generate_dataparser_outputs(split, **kwargs)
  File "/nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py", line 387, in _generate_dataparser_outputs
    self.sequence = pandaset[self.config.sequence]
  File "/usr/local/lib/python3.10/dist-packages/pandaset/dataset.py", line 28, in __getitem__
    return self._sequences[item]
KeyError: '001'

my file structure in the container like this

tree /data -L 2
/data
`-- pandaset
    |-- 001
    |-- 002
    |-- 003
    |-- 004
    |-- 005
    |-- 006
    |-- 008
    |-- 011
    |-- 012
    |-- 013
    |-- 014
    |-- 015
    |-- 016
    |-- 017
    |-- 018
    |-- 019
    |-- 020
    |-- 021
    |-- 023
    |-- 024
    |-- 027
    |-- 028
    |-- 029
    |-- 030
    |-- 032
    |-- 033
    |-- 034
    |-- 035
    |-- 037
    |-- 038
    |-- 039
    |-- 040
    |-- 041
    |-- 042
    |-- 043
    |-- 044
    |-- 045
    |-- 046
    `-- 047

40 directories, 0 files

Additional context since I do not know the right path of pandaset, so I also mkdir a folder named data/pandaset in /workspace/neurad-studio like

root@hil-pc:/workspace/neurad-studio# tree /workspace/neurad-studio/data/ -L 2
/workspace/neurad-studio/data/
`-- pandaset
    |-- 001
    |-- 002
    |-- 003
    |-- 004
    |-- 005
    |-- 006
    |-- 008
    |-- 011
    |-- 012
    |-- 013
    |-- 014
    |-- 015
    |-- 016
    |-- 017
    |-- 018
    |-- 019
    |-- 020
    |-- 021
    |-- 023
    |-- 024
    |-- 027
    |-- 028
    |-- 029
    |-- 030
    |-- 032
    |-- 033
    |-- 034
    |-- 035
    |-- 037
    |-- 038
    |-- 039
    |-- 040
    |-- 041
    |-- 042
    |-- 043
    |-- 044
    |-- 045
    |-- 046
    `-- 047

40 directories, 0 files
georghess commented 7 months ago

Hi,

You can specify the path to the dataset root as python nerfstudio/scripts/train.py neurad pandaset-data --data <your-panda-root> e.g. python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset

Let me know if that resolves your issue. I'll clarify this in the readme.

TurtleZhong commented 7 months ago

Hi, I tried with command python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset, then I got another error:

Traceback (most recent call last):
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 278, in <module>
    entrypoint()
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 269, in entrypoint
    main(
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 254, in main
    launch(
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 196, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 106, in train_loop
    trainer.setup()
  File "/nerfstudio/nerfstudio/engine/trainer.py", line 210, in setup
    self.pipeline = self.config.pipeline.setup(
  File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
    return self._target(self, **kwargs)
  File "/nerfstudio/nerfstudio/pipelines/ad_pipeline.py", line 62, in __init__
    super().__init__(config, **kwargs)
  File "/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 257, in __init__
    self.datamanager: DataManager = config.datamanager.setup(
  File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
    return self._target(self, **kwargs)
  File "/nerfstudio/nerfstudio/data/datamanagers/image_lidar_datamanager.py", line 195, in __init__
    super().__init__(config, device, test_mode, world_size, local_rank, **kwargs)
  File "/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py", line 163, in __init__
    self.train_dataparser_outputs: DataparserOutputs = self.dataparser.get_dataparser_outputs(split="train")
  File "/nerfstudio/nerfstudio/data/dataparsers/base_dataparser.py", line 171, in get_dataparser_outputs
    dataparser_outputs = self._generate_dataparser_outputs(split, **kwargs)
  File "/nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py", line 390, in _generate_dataparser_outputs
    return super()._generate_dataparser_outputs(split)
  File "/nerfstudio/nerfstudio/data/dataparsers/ad_dataparser.py", line 177, in _generate_dataparser_outputs
    lidars, pc_filenames = self._get_lidars() if self.config.lidars else (_empty_lidars(), [])
  File "/nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py", line 224, in _get_lidars
    filename = self.sequence.lidar._data_structure[i]
IndexError: list index out of range
root@hil-pc:/workspace/neurad-studio# ls data/pandaset/
001  003  005  008  012  014  016  018  020  023  027  029  032  034  037  039  041  043  045  047
002  004  006  011  013  015  017  019  021  024  028  030  033  035  038  040  042  044  046
root@hil-pc:/workspace/neurad-studio# pip3 list | grep pandaset
pandaset                  0.3.dev0

I will check this file /nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py later.

I found the dataset I download from this, the data structure like this:

root@hil-pc:/data/pandaset/001# tree -L 2
.
|-- LICENSE.txt
|-- annotations
|   |-- cuboids
|   `-- semseg
|-- camera
|   |-- back_camera
|   |-- front_camera
|   |-- front_left_camera
|   |-- front_right_camera
|   |-- left_camera
|   `-- right_camera
|-- lidar
|   |-- 00.pkl
---------
|   |-- 79.pkl
|   |-- poses.json
|   `-- timestamps.json
`-- meta
    |-- gps.json
    `-- timestamps.json

12 directories, 85 files
root@hil-pc:/data/pandaset/001# 

while the dataset in the pandaset pandaset-devkit github repo , the structure like this:

.
├── LICENSE.txt
├── annotations
│   ├── cuboids
│   │   ├── 00.pkl.gz
│   │   .
│   │   .
│   │   .
│   │   └── 79.pkl.gz
│   └── semseg  // Semantic Segmentation is available for specific scenes
│       ├── 00.pkl.gz
│       .
│       .
│       .
│       ├── 79.pkl.gz
│       └── classes.json
├── camera
│   ├── back_camera
│   │   ├── 00.jpg
│   │   .
│   │   .
│   │   .
│   │   ├── 79.jpg
│   │   ├── intrinsics.json
│   │   ├── poses.json
│   │   └── timestamps.json
│   ├── front_camera
│   │   └── ...
│   ├── front_left_camera
│   │   └── ...
│   ├── front_right_camera
│   │   └── ...
│   ├── left_camera
│   │   └── ...
│   └── right_camera
│       └── ...
├── lidar
│   ├── 00.pkl.gz
│   .
│   .
│   .
│   ├── 79.pkl.gz
│   ├── poses.json
│   └── timestamps.json
└── meta
    ├── gps.json
    └── timestamps.json

I check the source code the the extenstion like *.pkl.gz is important when loading the dataset. so actually I use the wrong dataset link from your README. so the dataset load failed.

amoghskanda commented 7 months ago

Hey @TurtleZhong thank you for the update on the dataset. When I visit the PandaSet website(https://pandaset.org/) and click on Download Dataset, it redirects me to a website called Scale(https://scale.com/resources/download/pandaset) where I get No matching content found. How did manage to download the dataset from PandaSet?

@georghess if you could take a look at this, would be glad! Asking this because there was an issue while loading the sequence lidar data. filename = self.sequence.lidar._data_structure[i] IndexError: list index out of range

I get this in line 224 in pandaset_dataparser. Did some debugging and found out that self.sequence.lidar._data_structure is an empty list indicating that the sequence lidar data did not load properly.

TurtleZhong commented 7 months ago

Hey @TurtleZhong thank you for the update on the dataset. When I visit the PandaSet website(https://pandaset.org/) and click on Download Dataset, it redirects me to a website called Scale(https://scale.com/resources/download/pandaset) where I get No matching content found. How did manage to download the dataset from PandaSet?

@georghess if you could take a look at this, would be glad! Asking this because there was an issue while loading the sequence lidar data. filename = self.sequence.lidar._data_structure[i] IndexError: list index out of range

I get this in line 224 in pandaset_dataparser. Did some debugging and found out that self.sequence.lidar._data_structure is an empty list indicating that the sequence lidar data did not load properly.

Download the dataset from here(https://www.kaggle.com/datasets/usharengaraju/pandaset-dataset/discussion), actually the address is in the README. but you should modify something since the dataset structure is a little difference with the original pandaset. if you unzip the dataset and place it in '/data/pandaset'

gzip -k 001/lidar/*.pkl
gzip -k 001/annotations/cuboids/*.pkl
gzip -k 001/annotations/semseg/*.pkl

and after the modify you run python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset, you will get:

Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
-----------------------------------------------------------------------------------                  
19100 (95.50%)      318.018 ms           4 m, 46 s            128.84 K                               
19200 (96.00%)      314.742 ms           4 m, 12 s            130.17 K                               
19300 (96.50%)      316.493 ms           3 m, 41 s            129.49 K                               
19400 (97.00%)      312.945 ms           3 m, 8 s             130.94 K                               
19500 (97.50%)      317.075 ms           2 m, 38 s            129.21 K                               
19600 (98.00%)      312.256 ms           2 m, 5 s             131.21 K                               
19700 (98.50%)      318.484 ms           1 m, 35 s            128.63 K                               
19800 (99.00%)      314.081 ms           1 m, 3 s             130.44 K                               
19900 (99.50%)      314.238 ms           31 s, 738.033 ms     130.40 K                               
20000 (100.00%)     313.234 ms           313.234 ms           130.84 K                               
---------------------------------------------------------------------------------------------------- 
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)                              
╭─────────────────────────────── 🎉 Training Finished 🎉 ───────────────────────────────╮
│                        ╷                                                              │
│   Config File          │ outputs/unnamed/neurad/2024-04-25_031752/config.yml          │
│   Checkpoint Directory │ outputs/unnamed/neurad/2024-04-25_031752/nerfstudio_models   │
│                        ╵                                                              │
╰───────────────────────────────────────────────────────────────────────────────────────╯
                                                   Use ctrl+c to quit              

@georghess I think it is necessary to write a more detailed step in the README.

amoghskanda commented 7 months ago

@TurtleZhong thank you for the above commands. The data preparation worked. However, I think there is an issue with the viewer AssertionError: Something went wrong! At least one of the client source or build directories should be present.

Did you have to change anything in the config or the websocket port before training? The issue is in viewer/viewer.py line 110. viser server related. I'm running the project locally in a conda env but I'm behind a firewall. Thanks

atonderski commented 7 months ago

@amoghskanda we fixed an issue related to that error recently, can you try reinstalling the latest viser? pip install --upgrade git+https://github.com/atonderski/viser.git

georghess commented 7 months ago

@TurtleZhong thanks for helping out on this! As you've seen, Scale has stopped hosting the dataset (where we downloaded it ~1 year ago). I was not aware that the one hosted on kaggle has a different fileformat, but I'll update the readme.

amoghskanda commented 7 months ago

Hey @atonderski thank you for the reply. However, it still throws the same error. I changed the websocket_port to 3028 and the websocket_host to '127.0.0.1' and it doesn't work. I'm behind corporate proxy and firewall, are there any changes that I have to make to the port and the host address? Thanks

atonderski commented 7 months ago

At least one of the client source or build directories should be present. This indicates an error in the installation of the viser package, as you are unable to build the web client. I don't think it can be related to firewalls or proxies (although those things could for sure cause other issues). Do you have a longer traceback? Also, can you run ls $(python -c "import viser; print(viser.__path__[0])")/client in your environment?

atonderski commented 7 months ago

@TurtleZhong we just realized that the pandaset hosted on kaggle only contains around half of the sequences. Are you aware of somewhere we can find the remaining half? :)

TurtleZhong commented 7 months ago

I have trained with pandaset, I found the 'steps_per_save: 2000' in 'output/.../config.yaml', but when I click into the output folder, I found it's empty in outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models/ like:

nerfstudio_models git:(main) ✗ ll
total 0
➜  nerfstudio_models git:(main) ✗ 

while the train.py log is:

[06:27:38] Saving config to: outputs/unnamed/neurad/2024-04-25_062738/config.yml                experiment_config.py:139
           Saving checkpoints to: outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models              trainer.py:194
Variable resolution, using variable_res_collate
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Started processes
Setting up evaluation dataset...
Caching all 240 images.
Caching all 40 images.
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
(viser) No client build found. Building now...
(viser) nodejs is set up!
(node:7871) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
yarn install v1.22.22
[1/4] Resolving packages...
success Already up-to-date.
Done in 0.26s.
yarn run v1.22.22
$ tsc && vite build
vite v5.2.6 building for production...
✓ 2 modules transformed.
x Build failed in 210ms
error during build:
Error: [vite-plugin-eslint] Failed to load config "react-app" to extend from.
Referenced from: /usr/local/lib/python3.10/dist-packages/viser/client/package.json
file: /usr/local/lib/python3.10/dist-packages/viser/client/src/index.tsx
    at configInvalidError (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2648:9)
    at ConfigArrayFactory._loadExtendedShareableConfig (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3279:23)
    at ConfigArrayFactory._loadExtends (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3156:25)
    at ConfigArrayFactory._normalizeObjectConfigDataBody (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3095:25)
    at _normalizeObjectConfigDataBody.next (<anonymous>)
    at ConfigArrayFactory._normalizeObjectConfigData (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3040:20)
    at _normalizeObjectConfigData.next (<anonymous>)
    at ConfigArrayFactory.loadInDirectory (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2886:28)
    at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3871:46)
    at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3890:20)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
╭─────────────── viser ───────────────╮
│             ╷                       │
│   HTTP      │ http://0.0.0.0:7007   │
│   Websocket │ ws://0.0.0.0:7007     │
│             ╵                       │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[06:29:01] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
-----------------------------------------------------------------------------------                  
2500 (12.50%)       308.525 ms           1 h, 29 m, 59 s      132.79 K                               
2600 (13.00%)       309.272 ms           1 h, 29 m, 41 s      132.50 K                               
2700 (13.50%)       311.918 ms           1 h, 29 m, 56 s      131.36 K                               
2800 (14.00%)       316.815 ms           1 h, 30 m, 49 s      129.31 K                               
2900 (14.50%)       322.783 ms           1 h, 31 m, 59 s      126.98 K                               
3000 (15.00%)       317.803 ms           1 h, 30 m, 2 s       128.92 K                               
3100 (15.50%)       319.304 ms           1 h, 29 m, 56 s      128.32 K                               
3200 (16.00%)       310.104 ms           1 h, 26 m, 50 s      132.14 K                               
3300 (16.50%)       310.348 ms           1 h, 26 m, 23 s      132.07 K                               
3400 (17.00%)       319.967 ms           1 h, 28 m, 31 s      128.03 K                               
---------------------------------------------------------------------------------------------------- 
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)                              

besides, I could not open http://localhost:7007.

TurtleZhong commented 7 months ago

@TurtleZhong we just realized that the pandaset hosted on kaggle only contains around half of the sequences. Are you aware of somewhere we can find the remaining half? :)

I am sorry but I do not know. I wonder if you can share a zip file. or may be we can use some other dataset.

amoghskanda commented 7 months ago

At least one of the client source or build directories should be present. This indicates an error in the installation of the viser package, as you are unable to build the web client. I don't think it can be related to firewalls or proxies (although those things could for sure cause other issues). Do you have a longer traceback? Also, can you run ls $(python -c "import viser; print(viser.__path__[0])")/client in your environment?

When I run the above command, No such file exists i.e viser/client is not present. Here's the error trail :

Traceback (most recent call last): File "nerfstudio/scripts/train.py", line 278, in entrypoint() File "nerfstudio/scripts/train.py", line 269, in entrypoint main( File "nerfstudio/scripts/train.py", line 254, in main launch( File "nerfstudio/scripts/train.py", line 196, in launch main_func(local_rank=0, world_size=world_size, config=config) File "nerfstudio/scripts/train.py", line 106, in train_loop trainer.setup() File "/home/user/neurad/nerfstudio/engine/trainer.py", line 239, in setup self.viewer_state = ViewerState( File "/home/user/neurad/nerfstudio/viewer/viewer.py", line 110, in init self.viser_server = viser.ViserServer(host=config.websocket_host, port=websocket_port) File "/home/user/.local/lib/python3.8/site-packages/viser/_viser.py", line 364, in _actual_init _client_autobuild.ensure_client_is_built() File "/home/user/.local/lib/python3.8/site-packages/viser/_client_autobuild.py", line 33, in ensure_client_is_built assert (build_dir / "index.html").exists(), ( AssertionError: Something went wrong! At least one of the client source or build directories should be present.

georghess commented 7 months ago

@TurtleZhong I put PandaSet at https://huggingface.co/datasets/georghess/pandaset/tree/main, updating download instructions later.

TurtleZhong commented 7 months ago

I have trained with pandaset, I found the 'steps_per_save: 2000' in 'output/.../config.yaml', but when I click into the output folder, I found it's empty in outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models/ like:

nerfstudio_models git:(main) ✗ ll
total 0
➜  nerfstudio_models git:(main) ✗ 

while the train.py log is:

[06:27:38] Saving config to: outputs/unnamed/neurad/2024-04-25_062738/config.yml                experiment_config.py:139
           Saving checkpoints to: outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models              trainer.py:194
Variable resolution, using variable_res_collate
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Started processes
Setting up evaluation dataset...
Caching all 240 images.
Caching all 40 images.
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
(viser) No client build found. Building now...
(viser) nodejs is set up!
(node:7871) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
yarn install v1.22.22
[1/4] Resolving packages...
success Already up-to-date.
Done in 0.26s.
yarn run v1.22.22
$ tsc && vite build
vite v5.2.6 building for production...
✓ 2 modules transformed.
x Build failed in 210ms
error during build:
Error: [vite-plugin-eslint] Failed to load config "react-app" to extend from.
Referenced from: /usr/local/lib/python3.10/dist-packages/viser/client/package.json
file: /usr/local/lib/python3.10/dist-packages/viser/client/src/index.tsx
    at configInvalidError (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2648:9)
    at ConfigArrayFactory._loadExtendedShareableConfig (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3279:23)
    at ConfigArrayFactory._loadExtends (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3156:25)
    at ConfigArrayFactory._normalizeObjectConfigDataBody (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3095:25)
    at _normalizeObjectConfigDataBody.next (<anonymous>)
    at ConfigArrayFactory._normalizeObjectConfigData (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3040:20)
    at _normalizeObjectConfigData.next (<anonymous>)
    at ConfigArrayFactory.loadInDirectory (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2886:28)
    at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3871:46)
    at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3890:20)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
╭─────────────── viser ───────────────╮
│             ╷                       │
│   HTTP      │ http://0.0.0.0:7007   │
│   Websocket │ ws://0.0.0.0:7007     │
│             ╵                       │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[06:29:01] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
-----------------------------------------------------------------------------------                  
2500 (12.50%)       308.525 ms           1 h, 29 m, 59 s      132.79 K                               
2600 (13.00%)       309.272 ms           1 h, 29 m, 41 s      132.50 K                               
2700 (13.50%)       311.918 ms           1 h, 29 m, 56 s      131.36 K                               
2800 (14.00%)       316.815 ms           1 h, 30 m, 49 s      129.31 K                               
2900 (14.50%)       322.783 ms           1 h, 31 m, 59 s      126.98 K                               
3000 (15.00%)       317.803 ms           1 h, 30 m, 2 s       128.92 K                               
3100 (15.50%)       319.304 ms           1 h, 29 m, 56 s      128.32 K                               
3200 (16.00%)       310.104 ms           1 h, 26 m, 50 s      132.14 K                               
3300 (16.50%)       310.348 ms           1 h, 26 m, 23 s      132.07 K                               
3400 (17.00%)       319.967 ms           1 h, 28 m, 31 s      128.03 K                               
---------------------------------------------------------------------------------------------------- 
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)                              

besides, I could not open http://localhost:7007.

I haven't been able to fix this issue yet. I'm wondering if you have any ideas on what is going wrong. @georghess If you have time, would you mind helping me take a look at this issue?

atonderski commented 7 months ago

@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with: pip uninstall viser pip install git+https://github.com/atonderski/viser.git

Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.

atonderski commented 7 months ago

If you continue having issues you can also try to install the original viser from pypi pip install --upgrade viser some QoL features will be missing, but nothing major. If this works, but not the solution above, please let me know :)

TurtleZhong commented 7 months ago

@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with: pip uninstall viser pip install git+https://github.com/atonderski/viser.git

Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.

Hi, According to your suggestion, I can successfully train and visualize via web page. but when the training process is done, I found the checkpoint model is not in the output folder. Below is my train process log and my test.

[04:45:24] Saving config to: outputs/unnamed/neurad/2024-04-26_044524/config.yml                experiment_config.py:139
           Saving checkpoints to: outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models              trainer.py:194
Variable resolution, using variable_res_collate
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Started processes
Setting up evaluation dataset...
Caching all 240 images.
Caching all 40 images.
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
╭─────────────── viser ───────────────╮
│             ╷                       │
│   HTTP      │ http://0.0.0.0:7007   │
│   Websocket │ ws://0.0.0.0:7007     │
│             ╵                       │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
12300 (61.50%)      310.254 ms           39 m, 49 s           132.20 K                                   
12400 (62.00%)      319.519 ms           40 m, 28 s           128.29 K                                   
12500 (62.50%)      315.160 ms           39 m, 24 s           130.13 K                                   
12501 (62.50%)      2.51 M               313.963 ms           39 m, 14 s           130.68 K              
12600 (63.00%)      318.951 ms           39 m, 20 s           128.57 K                                   
12601 (63.00%)      2.18 M               317.682 ms           39 m, 10 s           129.12 K              
12700 (63.50%)      314.319 ms           38 m, 14 s           130.44 K                                   
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
19100 (95.50%)      316.199 ms           4 m, 44 s            129.57 K                                   
19200 (96.00%)      318.434 ms           4 m, 15 s            128.70 K                                   
19300 (96.50%)      319.684 ms           3 m, 44 s            128.19 K                                   
19400 (97.00%)      320.288 ms           3 m, 12 s            127.91 K                                   in19500 (97.50%)      323.657 ms           2 m, 42 s            126.59 K                                   at19600 (98.00%)      323.304 ms           2 m, 9 s             126.78 K                                   
19700 (98.50%)      318.318 ms           1 m, 35 s            128.75 K                                   
19800 (99.00%)      319.041 ms           1 m, 4 s             128.42 K                                   on19900 (99.50%)      321.464 ms           32 s, 467.911 ms     127.49 K                                   co20000 (100.00%)     318.225 ms           318.225 ms           128.78 K                                   og----------------------------------------------------------------------------------------------------     
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)                                  
╭─────────────────────────────── 🎉 Training Finished 🎉 ───────────────────────────────╮
│                        ╷                                                              │
│   Config File          │ outputs/unnamed/neurad/2024-04-26_044524/config.yml          │
│   Checkpoint Directory │ outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models   │
│                        ╵                                                              │
╰───────────────────────────────────────────────────────────────────────────────────────╯
                                                   Use ctrl+c to quit                     

The output dir is:

root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models# pwd
/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models
root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models# ll
total 8
drwxr-xr-x 2 root root 4096 Apr 26 04:57 ./
drwxr-xr-x 4 root root 4096 Apr 26 06:46 ../
root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models# 

if the content in this dir is empty, I guess I can not run some commands like Resume from checkpoint / visualize existing run in the README. I guess there is something wrong with my settings, but I haven't found it yet.

After the training was successful, I used Render Options via localhost:7007 to generate 3 key frames, and then click the Generate Command, copy the command to generate a video, but an error was reported. the log is:

root@hil-pc:/workspace/neurad-studio# ns-render camera-path --load-config outputs/unnamed/neurad/2024-04-26_044524/config.yml --camera-path-filename /workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/camera_paths/2024-04-26-04-46-24.json --output-path renders/2024-04-26_044524/2024-04-26-04-46-24.mp4
Variable resolution, using variable_res_collate
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading latest checkpoint from load_dir
Traceback (most recent call last):
  File "/usr/local/bin/ns-render", line 8, in <module>
    sys.exit(entrypoint())
  File "/nerfstudio/nerfstudio/scripts/render.py", line 1148, in entrypoint
    tyro.cli(Commands).main()
  File "/nerfstudio/nerfstudio/scripts/render.py", line 453, in main
    _, pipeline, _, _ = eval_setup(
  File "/nerfstudio/nerfstudio/utils/eval_utils.py", line 118, in eval_setup
    checkpoint_path, step = eval_load_checkpoint(config, pipeline, strict_load, ignore_keys)
  File "/nerfstudio/nerfstudio/utils/eval_utils.py", line 59, in eval_load_checkpoint
    load_step = sorted(int(x[x.find("-") + 1 : x.find(".")]) for x in os.listdir(config.load_dir))[-1]
IndexError: list index out of range
root@hil-pc:/workspace/neurad-studio# 

Selection_005

I also tried the Preview Render in the web page, But the picture is very blurry, here is the video:

https://github.com/georghess/neurad-studio/assets/19700579/bf05b532-db87-476e-b2f9-033a61d0e580

georghess commented 7 months ago

@TurtleZhong the default config should save every 2000 steps (steps_per_save configures this), so it's surprising that the output folder is empty. I'm looking into it.

As for the rendering failing, it's because it cannot find any checkpoints. And the preview rendered will be blurry because the viewer sets rendering resolution adaptively, and NeuRAD is not capable of real-time HD rendering

TurtleZhong commented 7 months ago

@TurtleZhong the default config should save every 2000 steps (steps_per_save configures this), so it's surprising that the output folder is empty. I'm looking into it.

As for the rendering failing, it's because it cannot find any checkpoints. And the preview rendered will be blurry because the viewer sets rendering resolution adaptively, and NeuRAD is not capable of real-time HD rendering

Yep, I found the steps_per_save: 2000 in config.yaml file. In addition, my output file is: outputs.zip

georghess commented 7 months ago

@TurtleZhong found the bug! We have a metric tracker to avoid saving checkpoints if performance degrades. However, as you only ran the viewer without any eval we ended up never saving any checkpoint, e80f7280be5ae597cea6a1e031105864d4e35448 fixes this. Sorry for the inconvenience.

TurtleZhong commented 7 months ago

@TurtleZhong found the bug! We have a metric tracker to avoid saving checkpoints if performance degrades. However, as you only ran the viewer without any eval we ended up never saving any checkpoint, e80f728 fixes this. Sorry for the inconvenience.

Still not work after 2000 step, I check the commit of trainer.py, Line 91 self.latest = metrics.get(self.config.metric, None) if self.config.metric else None since the self.config.metric =None in Line 63, so I think self.latest = None. since self.latest = None, the update function will return.

atonderski commented 7 months ago

@TurtleZhong on current master, if I start a training with --vis=viewer, I get checkpoints every 2k steps as expected. Are you sure you ran with the e80f728 fix? like you say, self.latest will always be None, but then the did_degrade function has a new check that returns False if self.config.metric is None the issue was that this previously always returned True when running with only viewer, since evaluation is disabled when only running with viewer: https://github.com/georghess/neurad-studio/blob/e80f7280be5ae597cea6a1e031105864d4e35448/nerfstudio/engine/trainer.py#L512

TurtleZhong commented 7 months ago

@TurtleZhong on current master, if I start a training with --vis=viewer, I get checkpoints every 2k steps as expected. Are you sure you ran with the e80f728 fix? like you say, self.latest will always be None, but then the did_degrade function has a new check that returns False if self.config.metric is None the issue was that this previously always returned True when running with only viewer, since evaluation is disabled when only running with viewer:

https://github.com/georghess/neurad-studio/blob/e80f7280be5ae597cea6a1e031105864d4e35448/nerfstudio/engine/trainer.py#L512

Yes, I use the latest version of neurad. and I also check the save_checkpoints function, I think it is correct, but I still got nothing after 2k steps. btw, I test the latest code in the docker env. I have check the config.yaml and found that

checkpoint_saving_tracker: !!python/object:nerfstudio.engine.trainer.MetricTrackerConfig
  _target: &id001 !!python/name:nerfstudio.engine.trainer.MetricTracker ''
  higher_is_better: true
  margin: 0.05
  metric: psnr
data: null

I do not know if the metric should be None. but the default is psnr. config.zip

atonderski commented 7 months ago

Ah, but that means that your config for some reason is built from the previous neurad version. Are you resuming the training from an old checkpoint? Also can you make double sure that L63 in nerfstudio/engine/trainer.py is metric: Optional[str] = None?

atonderski commented 7 months ago

wait, are you running in docker? If so, are you mounting in the new version of the repository? Otherwise you might be running from whatever version was there when you built the image? something like: docker run -v $PWD:/nerfstudio ...

TurtleZhong commented 7 months ago

I start the docker container like:

      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"

Oh, I got it, since the repo has changed, I need build the docker image again. I found these lines in the docker image

# Copy nerfstudio folder.
ADD . /nerfstudio

# Install nerfstudio dependencies.
RUN cd /nerfstudio && python3.10 -m pip install --no-cache-dir -e .

I will try it later. Thanks:)

atonderski commented 7 months ago

You shouldn't have to rebuild the dockerfile every time :) I think it should be "/home/hil/work_zxl/Nerf/neurad-studio:/workspace". Then python nerfstudio/scripts/train.py will work.

Alternatively you could mount over the nerfstudio in the container, like "/home/hil/work_zxl/Nerf/neurad-studio:/nerfstudio"

TurtleZhong commented 7 months ago

You shouldn't have to rebuild the dockerfile every time :) I think it should be "/home/hil/work_zxl/Nerf/neurad-studio:/workspace". Then python nerfstudio/scripts/train.py will work.

Alternatively you could mount over the nerfstudio in the container, like "/home/hil/work_zxl/Nerf/neurad-studio:/nerfstudio"

I got the checkpoints after rebuild the docker images:) I think your method is also ok, I will try later.

drwxr-xr-x 2 root root       4096 Apr 26 13:54 ./
drwxr-xr-x 3 root root       4096 Apr 26 13:54 ../
-rw-r--r-- 1 root root 1406347498 Apr 26 13:54 step-000002000.ckpt
amoghskanda commented 7 months ago

@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with: pip uninstall viser pip install git+https://github.com/atonderski/viser.git

Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.

Hey, sorry to reopen this issue. I fixed the viser issue by following the above commands. Thanks for that. However, I'm not sure if training has started. This is the terminal output

from pkg_resources import parse_version

(!) Some chunks are larger than 500 kB after minification. Consider:

The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr

TurtleZhong commented 7 months ago

The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr

Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:

[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
12300 (61.50%)      310.254 ms           39 m, 49 s           132.20 K                                   
12400 (62.00%)      319.519 ms           40 m, 28 s           128.29 K                                   
12500 (62.50%)      315.160 ms           39 m, 24 s           130.13 K                                   
12501 (62.50%)      2.51 M               313.963 ms           39 m, 14 s           130.68 K              
12600 (63.00%)      318.951 ms           39 m, 20 s           128.57 K                                   
12601 (63.00%)      2.18 M               317.682 ms           39 m, 10 s           129.12 K              
12700 (63.50%)      314.319 ms           38 m, 14 s           130.44 K                                   
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 

btw, I also used RTX3090.

amoghskanda commented 7 months ago

The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr

Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:

[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
12300 (61.50%)      310.254 ms           39 m, 49 s           132.20 K                                   
12400 (62.00%)      319.519 ms           40 m, 28 s           128.29 K                                   
12500 (62.50%)      315.160 ms           39 m, 24 s           130.13 K                                   
12501 (62.50%)      2.51 M               313.963 ms           39 m, 14 s           130.68 K              
12600 (63.00%)      318.951 ms           39 m, 20 s           128.57 K                                   
12601 (63.00%)      2.18 M               317.682 ms           39 m, 10 s           129.12 K              
12700 (63.50%)      314.319 ms           38 m, 14 s           130.44 K                                   
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 

btw, I also used RTX3090.

When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory

TurtleZhong commented 7 months ago

The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr

Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:

[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
12300 (61.50%)      310.254 ms           39 m, 49 s           132.20 K                                   
12400 (62.00%)      319.519 ms           40 m, 28 s           128.29 K                                   
12500 (62.50%)      315.160 ms           39 m, 24 s           130.13 K                                   
12501 (62.50%)      2.51 M               313.963 ms           39 m, 14 s           130.68 K              
12600 (63.00%)      318.951 ms           39 m, 20 s           128.57 K                                   
12601 (63.00%)      2.18 M               317.682 ms           39 m, 10 s           129.12 K              
12700 (63.50%)      314.319 ms           38 m, 14 s           130.44 K                                   
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 

btw, I also used RTX3090.

When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory

How about trying use docker-compose to start docker container, here is my yaml file:

version: '3'
services:
  service1:
    container_name: nerf_neurad_studio
    image: neurad_studio:v0.1
    privileged: true
    runtime: nvidia
    network_mode: host
    devices:
      # - "/dev/shm:/dev/shm"
      - "/dev/nvidia0:/dev/nvidia0"
    environment:
      - DISPLAY=$DISPLAY
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
      - QT_X11_NO_MITSHM=1 # Fix a bug with QT
      - SDL_VIDEODRIVER=x11
    volumes:
      - "/tmp/.X11-unix:/tmp/.X11-unix:rw"
      - "/dev/shm:/dev/shm"
      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
      - "/lib/modules:/lib/modules"
    command: tail -f /dev/null

remember change the path:

      - image: neurad_studio:v0.1
      - *****
      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
      - ***

after start the container,

sudo docker exec -it nerf_neurad_studio /bin/bash
python nerfstudio/scripts/train.py neurad pandaset-data
amoghskanda commented 7 months ago

The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr

Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:

[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line        writer.py:449
           wrapping.                                                                                                    
Step (% Done)       Train Iter (time)    ETA (time)           Train Rays / Sec                       
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 
12300 (61.50%)      310.254 ms           39 m, 49 s           132.20 K                                   
12400 (62.00%)      319.519 ms           40 m, 28 s           128.29 K                                   
12500 (62.50%)      315.160 ms           39 m, 24 s           130.13 K                                   
12501 (62.50%)      2.51 M               313.963 ms           39 m, 14 s           130.68 K              
12600 (63.00%)      318.951 ms           39 m, 20 s           128.57 K                                   
12601 (63.00%)      2.18 M               317.682 ms           39 m, 10 s           129.12 K              
12700 (63.50%)      314.319 ms           38 m, 14 s           130.44 K                                   
Step (% Done)       Vis Rays / Sec       Train Iter (time)    ETA (time)           Train Rays / Sec      
-------------------------------------------------------------------------------------------------------- 

btw, I also used RTX3090.

When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory

How about trying use docker-compose to start docker container, here is my yaml file:

version: '3'
services:
  service1:
    container_name: nerf_neurad_studio
    image: neurad_studio:v0.1
    privileged: true
    runtime: nvidia
    network_mode: host
    devices:
      # - "/dev/shm:/dev/shm"
      - "/dev/nvidia0:/dev/nvidia0"
    environment:
      - DISPLAY=$DISPLAY
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
      - QT_X11_NO_MITSHM=1 # Fix a bug with QT
      - SDL_VIDEODRIVER=x11
    volumes:
      - "/tmp/.X11-unix:/tmp/.X11-unix:rw"
      - "/dev/shm:/dev/shm"
      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
      - "/lib/modules:/lib/modules"
    command: tail -f /dev/null

remember change the path:

      - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
      - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"

Hey, thank you for the yaml file! Unfortunately, the output is the same as running in the conda env i.e the training doesn't start for some reason. It's just stuck after loading the http link and viser websocket

atonderski commented 7 months ago

I have not been able to reproduce this issue, which makes it very difficult to address :/ Let's see if the command in #8 helps