Closed TurtleZhong closed 7 months ago
Hi,
You can specify the path to the dataset root as
python nerfstudio/scripts/train.py neurad pandaset-data --data <your-panda-root>
e.g. python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset
Let me know if that resolves your issue. I'll clarify this in the readme.
Hi, I tried with command python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset
, then I got another error:
Traceback (most recent call last):
File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 278, in <module>
entrypoint()
File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 269, in entrypoint
main(
File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 254, in main
launch(
File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 196, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File "/workspace/neurad-studio/nerfstudio/scripts/train.py", line 106, in train_loop
trainer.setup()
File "/nerfstudio/nerfstudio/engine/trainer.py", line 210, in setup
self.pipeline = self.config.pipeline.setup(
File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
return self._target(self, **kwargs)
File "/nerfstudio/nerfstudio/pipelines/ad_pipeline.py", line 62, in __init__
super().__init__(config, **kwargs)
File "/nerfstudio/nerfstudio/pipelines/base_pipeline.py", line 257, in __init__
self.datamanager: DataManager = config.datamanager.setup(
File "/nerfstudio/nerfstudio/configs/base_config.py", line 54, in setup
return self._target(self, **kwargs)
File "/nerfstudio/nerfstudio/data/datamanagers/image_lidar_datamanager.py", line 195, in __init__
super().__init__(config, device, test_mode, world_size, local_rank, **kwargs)
File "/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py", line 163, in __init__
self.train_dataparser_outputs: DataparserOutputs = self.dataparser.get_dataparser_outputs(split="train")
File "/nerfstudio/nerfstudio/data/dataparsers/base_dataparser.py", line 171, in get_dataparser_outputs
dataparser_outputs = self._generate_dataparser_outputs(split, **kwargs)
File "/nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py", line 390, in _generate_dataparser_outputs
return super()._generate_dataparser_outputs(split)
File "/nerfstudio/nerfstudio/data/dataparsers/ad_dataparser.py", line 177, in _generate_dataparser_outputs
lidars, pc_filenames = self._get_lidars() if self.config.lidars else (_empty_lidars(), [])
File "/nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py", line 224, in _get_lidars
filename = self.sequence.lidar._data_structure[i]
IndexError: list index out of range
root@hil-pc:/workspace/neurad-studio# ls data/pandaset/
001 003 005 008 012 014 016 018 020 023 027 029 032 034 037 039 041 043 045 047
002 004 006 011 013 015 017 019 021 024 028 030 033 035 038 040 042 044 046
root@hil-pc:/workspace/neurad-studio# pip3 list | grep pandaset
pandaset 0.3.dev0
I will check this file /nerfstudio/nerfstudio/data/dataparsers/pandaset_dataparser.py
later.
I found the dataset I download from this, the data structure like this:
root@hil-pc:/data/pandaset/001# tree -L 2
.
|-- LICENSE.txt
|-- annotations
| |-- cuboids
| `-- semseg
|-- camera
| |-- back_camera
| |-- front_camera
| |-- front_left_camera
| |-- front_right_camera
| |-- left_camera
| `-- right_camera
|-- lidar
| |-- 00.pkl
---------
| |-- 79.pkl
| |-- poses.json
| `-- timestamps.json
`-- meta
|-- gps.json
`-- timestamps.json
12 directories, 85 files
root@hil-pc:/data/pandaset/001#
while the dataset in the pandaset pandaset-devkit github repo , the structure like this:
.
├── LICENSE.txt
├── annotations
│ ├── cuboids
│ │ ├── 00.pkl.gz
│ │ .
│ │ .
│ │ .
│ │ └── 79.pkl.gz
│ └── semseg // Semantic Segmentation is available for specific scenes
│ ├── 00.pkl.gz
│ .
│ .
│ .
│ ├── 79.pkl.gz
│ └── classes.json
├── camera
│ ├── back_camera
│ │ ├── 00.jpg
│ │ .
│ │ .
│ │ .
│ │ ├── 79.jpg
│ │ ├── intrinsics.json
│ │ ├── poses.json
│ │ └── timestamps.json
│ ├── front_camera
│ │ └── ...
│ ├── front_left_camera
│ │ └── ...
│ ├── front_right_camera
│ │ └── ...
│ ├── left_camera
│ │ └── ...
│ └── right_camera
│ └── ...
├── lidar
│ ├── 00.pkl.gz
│ .
│ .
│ .
│ ├── 79.pkl.gz
│ ├── poses.json
│ └── timestamps.json
└── meta
├── gps.json
└── timestamps.json
I check the source code the the extenstion like *.pkl.gz is important when loading the dataset. so actually I use the wrong dataset link from your README. so the dataset load failed.
Hey @TurtleZhong thank you for the update on the dataset. When I visit the PandaSet website(https://pandaset.org/) and click on Download Dataset, it redirects me to a website called Scale(https://scale.com/resources/download/pandaset) where I get No matching content found. How did manage to download the dataset from PandaSet?
@georghess if you could take a look at this, would be glad! Asking this because there was an issue while loading the sequence lidar data. filename = self.sequence.lidar._data_structure[i] IndexError: list index out of range
I get this in line 224 in pandaset_dataparser. Did some debugging and found out that self.sequence.lidar._data_structure is an empty list indicating that the sequence lidar data did not load properly.
Hey @TurtleZhong thank you for the update on the dataset. When I visit the PandaSet website(https://pandaset.org/) and click on Download Dataset, it redirects me to a website called Scale(https://scale.com/resources/download/pandaset) where I get No matching content found. How did manage to download the dataset from PandaSet?
@georghess if you could take a look at this, would be glad! Asking this because there was an issue while loading the sequence lidar data. filename = self.sequence.lidar._data_structure[i] IndexError: list index out of range
I get this in line 224 in pandaset_dataparser. Did some debugging and found out that self.sequence.lidar._data_structure is an empty list indicating that the sequence lidar data did not load properly.
Download the dataset from here(https://www.kaggle.com/datasets/usharengaraju/pandaset-dataset/discussion), actually the address is in the README. but you should modify something since the dataset structure is a little difference with the original pandaset. if you unzip the dataset and place it in '/data/pandaset'
gzip -k 001/lidar/*.pkl
gzip -k 001/annotations/cuboids/*.pkl
gzip -k 001/annotations/semseg/*.pkl
and after the modify you run python nerfstudio/scripts/train.py neurad pandaset-data --data /data/pandaset
, you will get:
Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec
-----------------------------------------------------------------------------------
19100 (95.50%) 318.018 ms 4 m, 46 s 128.84 K
19200 (96.00%) 314.742 ms 4 m, 12 s 130.17 K
19300 (96.50%) 316.493 ms 3 m, 41 s 129.49 K
19400 (97.00%) 312.945 ms 3 m, 8 s 130.94 K
19500 (97.50%) 317.075 ms 2 m, 38 s 129.21 K
19600 (98.00%) 312.256 ms 2 m, 5 s 131.21 K
19700 (98.50%) 318.484 ms 1 m, 35 s 128.63 K
19800 (99.00%) 314.081 ms 1 m, 3 s 130.44 K
19900 (99.50%) 314.238 ms 31 s, 738.033 ms 130.40 K
20000 (100.00%) 313.234 ms 313.234 ms 130.84 K
----------------------------------------------------------------------------------------------------
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
╭─────────────────────────────── 🎉 Training Finished 🎉 ───────────────────────────────╮
│ ╷ │
│ Config File │ outputs/unnamed/neurad/2024-04-25_031752/config.yml │
│ Checkpoint Directory │ outputs/unnamed/neurad/2024-04-25_031752/nerfstudio_models │
│ ╵ │
╰───────────────────────────────────────────────────────────────────────────────────────╯
Use ctrl+c to quit
@georghess I think it is necessary to write a more detailed step in the README.
@TurtleZhong thank you for the above commands. The data preparation worked. However, I think there is an issue with the viewer AssertionError: Something went wrong! At least one of the client source or build directories should be present.
Did you have to change anything in the config or the websocket port before training? The issue is in viewer/viewer.py
line 110. viser server related. I'm running the project locally in a conda env but I'm behind a firewall. Thanks
@amoghskanda we fixed an issue related to that error recently, can you try reinstalling the latest viser?
pip install --upgrade git+https://github.com/atonderski/viser.git
@TurtleZhong thanks for helping out on this! As you've seen, Scale has stopped hosting the dataset (where we downloaded it ~1 year ago). I was not aware that the one hosted on kaggle has a different fileformat, but I'll update the readme.
Hey @atonderski thank you for the reply. However, it still throws the same error. I changed the websocket_port to 3028 and the websocket_host to '127.0.0.1' and it doesn't work. I'm behind corporate proxy and firewall, are there any changes that I have to make to the port and the host address? Thanks
At least one of the client source or build directories should be present.
This indicates an error in the installation of the viser package, as you are unable to build the web client. I don't think it can be related to firewalls or proxies (although those things could for sure cause other issues). Do you have a longer traceback?
Also, can you run
ls $(python -c "import viser; print(viser.__path__[0])")/client
in your environment?
@TurtleZhong we just realized that the pandaset hosted on kaggle only contains around half of the sequences. Are you aware of somewhere we can find the remaining half? :)
I have trained with pandaset, I found the 'steps_per_save: 2000' in 'output/.../config.yaml', but when I click into the output folder, I found it's empty in outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models/
like:
nerfstudio_models git:(main) ✗ ll
total 0
➜ nerfstudio_models git:(main) ✗
while the train.py log is:
[06:27:38] Saving config to: outputs/unnamed/neurad/2024-04-25_062738/config.yml experiment_config.py:139
Saving checkpoints to: outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models trainer.py:194
Variable resolution, using variable_res_collate
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Started processes
Setting up evaluation dataset...
Caching all 240 images.
Caching all 40 images.
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
(viser) No client build found. Building now...
(viser) nodejs is set up!
(node:7871) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)
yarn install v1.22.22
[1/4] Resolving packages...
success Already up-to-date.
Done in 0.26s.
yarn run v1.22.22
$ tsc && vite build
vite v5.2.6 building for production...
✓ 2 modules transformed.
x Build failed in 210ms
error during build:
Error: [vite-plugin-eslint] Failed to load config "react-app" to extend from.
Referenced from: /usr/local/lib/python3.10/dist-packages/viser/client/package.json
file: /usr/local/lib/python3.10/dist-packages/viser/client/src/index.tsx
at configInvalidError (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2648:9)
at ConfigArrayFactory._loadExtendedShareableConfig (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3279:23)
at ConfigArrayFactory._loadExtends (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3156:25)
at ConfigArrayFactory._normalizeObjectConfigDataBody (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3095:25)
at _normalizeObjectConfigDataBody.next (<anonymous>)
at ConfigArrayFactory._normalizeObjectConfigData (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3040:20)
at _normalizeObjectConfigData.next (<anonymous>)
at ConfigArrayFactory.loadInDirectory (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2886:28)
at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3871:46)
at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3890:20)
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007 │
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[06:29:01] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449
wrapping.
Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec
-----------------------------------------------------------------------------------
2500 (12.50%) 308.525 ms 1 h, 29 m, 59 s 132.79 K
2600 (13.00%) 309.272 ms 1 h, 29 m, 41 s 132.50 K
2700 (13.50%) 311.918 ms 1 h, 29 m, 56 s 131.36 K
2800 (14.00%) 316.815 ms 1 h, 30 m, 49 s 129.31 K
2900 (14.50%) 322.783 ms 1 h, 31 m, 59 s 126.98 K
3000 (15.00%) 317.803 ms 1 h, 30 m, 2 s 128.92 K
3100 (15.50%) 319.304 ms 1 h, 29 m, 56 s 128.32 K
3200 (16.00%) 310.104 ms 1 h, 26 m, 50 s 132.14 K
3300 (16.50%) 310.348 ms 1 h, 26 m, 23 s 132.07 K
3400 (17.00%) 319.967 ms 1 h, 28 m, 31 s 128.03 K
----------------------------------------------------------------------------------------------------
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
besides, I could not open http://localhost:7007
.
@TurtleZhong we just realized that the pandaset hosted on kaggle only contains around half of the sequences. Are you aware of somewhere we can find the remaining half? :)
I am sorry but I do not know. I wonder if you can share a zip file. or may be we can use some other dataset.
At least one of the client source or build directories should be present.
This indicates an error in the installation of the viser package, as you are unable to build the web client. I don't think it can be related to firewalls or proxies (although those things could for sure cause other issues). Do you have a longer traceback? Also, can you runls $(python -c "import viser; print(viser.__path__[0])")/client
in your environment?
When I run the above command, No such file exists i.e viser/client is not present. Here's the error trail :
Traceback (most recent call last):
File "nerfstudio/scripts/train.py", line 278, in
@TurtleZhong I put PandaSet at https://huggingface.co/datasets/georghess/pandaset/tree/main, updating download instructions later.
I have trained with pandaset, I found the 'steps_per_save: 2000' in 'output/.../config.yaml', but when I click into the output folder, I found it's empty in
outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models/
like:nerfstudio_models git:(main) ✗ ll total 0 ➜ nerfstudio_models git:(main) ✗
while the train.py log is:
[06:27:38] Saving config to: outputs/unnamed/neurad/2024-04-25_062738/config.yml experiment_config.py:139 Saving checkpoints to: outputs/unnamed/neurad/2024-04-25_062738/nerfstudio_models trainer.py:194 Variable resolution, using variable_res_collate Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04 Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 Started processes Setting up evaluation dataset... Caching all 240 images. Caching all 40 images. /usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights. warnings.warn(msg) (viser) No client build found. Building now... (viser) nodejs is set up! (node:7871) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 close listeners added to [TLSSocket]. Use emitter.setMaxListeners() to increase limit (Use `node --trace-warnings ...` to show where the warning was created) yarn install v1.22.22 [1/4] Resolving packages... success Already up-to-date. Done in 0.26s. yarn run v1.22.22 $ tsc && vite build vite v5.2.6 building for production... ✓ 2 modules transformed. x Build failed in 210ms error during build: Error: [vite-plugin-eslint] Failed to load config "react-app" to extend from. Referenced from: /usr/local/lib/python3.10/dist-packages/viser/client/package.json file: /usr/local/lib/python3.10/dist-packages/viser/client/src/index.tsx at configInvalidError (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2648:9) at ConfigArrayFactory._loadExtendedShareableConfig (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3279:23) at ConfigArrayFactory._loadExtends (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3156:25) at ConfigArrayFactory._normalizeObjectConfigDataBody (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3095:25) at _normalizeObjectConfigDataBody.next (<anonymous>) at ConfigArrayFactory._normalizeObjectConfigData (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3040:20) at _normalizeObjectConfigData.next (<anonymous>) at ConfigArrayFactory.loadInDirectory (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:2886:28) at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3871:46) at CascadingConfigArrayFactory._loadConfigInAncestors (/usr/local/lib/python3.10/dist-packages/viser/client/node_modules/@eslint/eslintrc/dist/eslintrc.cjs:3890:20) error Command failed with exit code 1. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command. ╭─────────────── viser ───────────────╮ │ ╷ │ │ HTTP │ http://0.0.0.0:7007 │ │ Websocket │ ws://0.0.0.0:7007 │ │ ╵ │ ╰─────────────────────────────────────╯ [NOTE] Not running eval iterations since only viewer is enabled. Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval. No Nerfstudio checkpoint to load, so training from scratch. Disabled comet/tensorboard/wandb event writers [06:29:01] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449 wrapping. Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec ----------------------------------------------------------------------------------- 2500 (12.50%) 308.525 ms 1 h, 29 m, 59 s 132.79 K 2600 (13.00%) 309.272 ms 1 h, 29 m, 41 s 132.50 K 2700 (13.50%) 311.918 ms 1 h, 29 m, 56 s 131.36 K 2800 (14.00%) 316.815 ms 1 h, 30 m, 49 s 129.31 K 2900 (14.50%) 322.783 ms 1 h, 31 m, 59 s 126.98 K 3000 (15.00%) 317.803 ms 1 h, 30 m, 2 s 128.92 K 3100 (15.50%) 319.304 ms 1 h, 29 m, 56 s 128.32 K 3200 (16.00%) 310.104 ms 1 h, 26 m, 50 s 132.14 K 3300 (16.50%) 310.348 ms 1 h, 26 m, 23 s 132.07 K 3400 (17.00%) 319.967 ms 1 h, 28 m, 31 s 128.03 K ---------------------------------------------------------------------------------------------------- Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
besides, I could not open
http://localhost:7007
.
I haven't been able to fix this issue yet. I'm wondering if you have any ideas on what is going wrong. @georghess If you have time, would you mind helping me take a look at this issue?
@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with:
pip uninstall viser
pip install git+https://github.com/atonderski/viser.git
Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.
If you continue having issues you can also try to install the original viser from pypi
pip install --upgrade viser
some QoL features will be missing, but nothing major.
If this works, but not the solution above, please let me know :)
@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with:
pip uninstall viser
pip install git+https://github.com/atonderski/viser.git
Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.
Hi, According to your suggestion, I can successfully train and visualize via web page. but when the training process is done, I found the checkpoint model is not in the output folder. Below is my train process log and my test.
[04:45:24] Saving config to: outputs/unnamed/neurad/2024-04-26_044524/config.yml experiment_config.py:139
Saving checkpoints to: outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models trainer.py:194
Variable resolution, using variable_res_collate
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
Loading data batch ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Started processes
Setting up evaluation dataset...
Caching all 240 images.
Caching all 40 images.
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007 │
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449
wrapping.
Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec
Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec
--------------------------------------------------------------------------------------------------------
12300 (61.50%) 310.254 ms 39 m, 49 s 132.20 K
12400 (62.00%) 319.519 ms 40 m, 28 s 128.29 K
12500 (62.50%) 315.160 ms 39 m, 24 s 130.13 K
12501 (62.50%) 2.51 M 313.963 ms 39 m, 14 s 130.68 K
12600 (63.00%) 318.951 ms 39 m, 20 s 128.57 K
12601 (63.00%) 2.18 M 317.682 ms 39 m, 10 s 129.12 K
12700 (63.50%) 314.319 ms 38 m, 14 s 130.44 K
Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec
--------------------------------------------------------------------------------------------------------
19100 (95.50%) 316.199 ms 4 m, 44 s 129.57 K
19200 (96.00%) 318.434 ms 4 m, 15 s 128.70 K
19300 (96.50%) 319.684 ms 3 m, 44 s 128.19 K
19400 (97.00%) 320.288 ms 3 m, 12 s 127.91 K in19500 (97.50%) 323.657 ms 2 m, 42 s 126.59 K at19600 (98.00%) 323.304 ms 2 m, 9 s 126.78 K
19700 (98.50%) 318.318 ms 1 m, 35 s 128.75 K
19800 (99.00%) 319.041 ms 1 m, 4 s 128.42 K on19900 (99.50%) 321.464 ms 32 s, 467.911 ms 127.49 K co20000 (100.00%) 318.225 ms 318.225 ms 128.78 K og----------------------------------------------------------------------------------------------------
Viewer running locally at: http://localhost:7007 (listening on 0.0.0.0)
╭─────────────────────────────── 🎉 Training Finished 🎉 ───────────────────────────────╮
│ ╷ │
│ Config File │ outputs/unnamed/neurad/2024-04-26_044524/config.yml │
│ Checkpoint Directory │ outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models │
│ ╵ │
╰───────────────────────────────────────────────────────────────────────────────────────╯
Use ctrl+c to quit
The output dir is:
root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models# pwd
/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models
root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models# ll
total 8
drwxr-xr-x 2 root root 4096 Apr 26 04:57 ./
drwxr-xr-x 4 root root 4096 Apr 26 06:46 ../
root@hil-pc:/workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/nerfstudio_models#
if the content in this dir is empty, I guess I can not run some commands like Resume from checkpoint / visualize existing run
in the README.
I guess there is something wrong with my settings, but I haven't found it yet.
After the training was successful, I used Render Options via localhost:7007 to generate 3 key frames, and then click the Generate Command
, copy the command to generate a video, but an error was reported. the log is:
root@hil-pc:/workspace/neurad-studio# ns-render camera-path --load-config outputs/unnamed/neurad/2024-04-26_044524/config.yml --camera-path-filename /workspace/neurad-studio/outputs/unnamed/neurad/2024-04-26_044524/camera_paths/2024-04-26-04-46-24.json --output-path renders/2024-04-26_044524/2024-04-26-04-46-24.mp4
Variable resolution, using variable_res_collate
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG19_Weights.IMAGENET1K_V1`. You can also use `weights=VGG19_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Loading latest checkpoint from load_dir
Traceback (most recent call last):
File "/usr/local/bin/ns-render", line 8, in <module>
sys.exit(entrypoint())
File "/nerfstudio/nerfstudio/scripts/render.py", line 1148, in entrypoint
tyro.cli(Commands).main()
File "/nerfstudio/nerfstudio/scripts/render.py", line 453, in main
_, pipeline, _, _ = eval_setup(
File "/nerfstudio/nerfstudio/utils/eval_utils.py", line 118, in eval_setup
checkpoint_path, step = eval_load_checkpoint(config, pipeline, strict_load, ignore_keys)
File "/nerfstudio/nerfstudio/utils/eval_utils.py", line 59, in eval_load_checkpoint
load_step = sorted(int(x[x.find("-") + 1 : x.find(".")]) for x in os.listdir(config.load_dir))[-1]
IndexError: list index out of range
root@hil-pc:/workspace/neurad-studio#
I also tried the Preview Render in the web page, But the picture is very blurry, here is the video:
https://github.com/georghess/neurad-studio/assets/19700579/bf05b532-db87-476e-b2f9-033a61d0e580
@TurtleZhong the default config should save every 2000 steps (steps_per_save
configures this), so it's surprising that the output folder is empty. I'm looking into it.
As for the rendering failing, it's because it cannot find any checkpoints. And the preview rendered will be blurry because the viewer sets rendering resolution adaptively, and NeuRAD is not capable of real-time HD rendering
@TurtleZhong the default config should save every 2000 steps (
steps_per_save
configures this), so it's surprising that the output folder is empty. I'm looking into it.As for the rendering failing, it's because it cannot find any checkpoints. And the preview rendered will be blurry because the viewer sets rendering resolution adaptively, and NeuRAD is not capable of real-time HD rendering
Yep, I found the steps_per_save: 2000
in config.yaml file. In addition, my output file is:
outputs.zip
@TurtleZhong found the bug! We have a metric tracker to avoid saving checkpoints if performance degrades. However, as you only ran the viewer without any eval we ended up never saving any checkpoint, e80f7280be5ae597cea6a1e031105864d4e35448 fixes this. Sorry for the inconvenience.
@TurtleZhong found the bug! We have a metric tracker to avoid saving checkpoints if performance degrades. However, as you only ran the viewer without any eval we ended up never saving any checkpoint, e80f728 fixes this. Sorry for the inconvenience.
Still not work after 2000 step, I check the commit of trainer.py, Line 91 self.latest = metrics.get(self.config.metric, None) if self.config.metric else None
since the self.config.metric =None in Line 63, so I think self.latest = None
. since self.latest = None
, the update function will return.
@TurtleZhong on current master, if I start a training with --vis=viewer, I get checkpoints every 2k steps as expected. Are you sure you ran with the e80f728 fix?
like you say, self.latest
will always be None, but then the did_degrade
function has a new check that returns False
if self.config.metric is None
the issue was that this previously always returned True when running with only viewer, since evaluation is disabled when only running with viewer: https://github.com/georghess/neurad-studio/blob/e80f7280be5ae597cea6a1e031105864d4e35448/nerfstudio/engine/trainer.py#L512
@TurtleZhong on current master, if I start a training with --vis=viewer, I get checkpoints every 2k steps as expected. Are you sure you ran with the e80f728 fix? like you say,
self.latest
will always be None, but then thedid_degrade
function has a new check that returnsFalse
ifself.config.metric is None
the issue was that this previously always returned True when running with only viewer, since evaluation is disabled when only running with viewer:
Yes, I use the latest version of neurad. and I also check the save_checkpoints function, I think it is correct, but I still got nothing after 2k steps. btw, I test the latest code in the docker env. I have check the config.yaml and found that
checkpoint_saving_tracker: !!python/object:nerfstudio.engine.trainer.MetricTrackerConfig
_target: &id001 !!python/name:nerfstudio.engine.trainer.MetricTracker ''
higher_is_better: true
margin: 0.05
metric: psnr
data: null
I do not know if the metric should be None. but the default is psnr. config.zip
Ah, but that means that your config for some reason is built from the previous neurad version. Are you resuming the training from an old checkpoint? Also can you make double sure that L63 in nerfstudio/engine/trainer.py
is metric: Optional[str] = None
?
wait, are you running in docker? If so, are you mounting in the new version of the repository? Otherwise you might be running from whatever version was there when you built the image?
something like:
docker run -v $PWD:/nerfstudio ...
I start the docker container like:
- "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
- "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
Oh, I got it, since the repo has changed, I need build the docker image again. I found these lines in the docker image
# Copy nerfstudio folder.
ADD . /nerfstudio
# Install nerfstudio dependencies.
RUN cd /nerfstudio && python3.10 -m pip install --no-cache-dir -e .
I will try it later. Thanks:)
You shouldn't have to rebuild the dockerfile every time :) I think it should be "/home/hil/work_zxl/Nerf/neurad-studio:/workspace". Then python nerfstudio/scripts/train.py will work.
Alternatively you could mount over the nerfstudio in the container, like "/home/hil/work_zxl/Nerf/neurad-studio:/nerfstudio"
You shouldn't have to rebuild the dockerfile every time :) I think it should be "/home/hil/work_zxl/Nerf/neurad-studio:/workspace". Then python nerfstudio/scripts/train.py will work.
Alternatively you could mount over the nerfstudio in the container, like "/home/hil/work_zxl/Nerf/neurad-studio:/nerfstudio"
I got the checkpoints after rebuild the docker images:) I think your method is also ok, I will try later.
drwxr-xr-x 2 root root 4096 Apr 26 13:54 ./
drwxr-xr-x 3 root root 4096 Apr 26 13:54 ../
-rw-r--r-- 1 root root 1406347498 Apr 26 13:54 step-000002000.ckpt
@TurtleZhong that issue was fixed yesterday on the latest version of our viser fork, is it possible that you haven't reinstalled since yesterday? Please try with:
pip uninstall viser
pip install git+https://github.com/atonderski/viser.git
Same to @amoghskanda, I just tested in a clean environment successfully it builds correctly.
Hey, sorry to reopen this issue. I fixed the viser issue by following the above commands. Thanks for that. However, I'm not sure if training has started. This is the terminal output
from pkg_resources import parse_version
(!) Some chunks are larger than 500 kB after minification. Consider:
The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr
The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr
Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449
wrapping.
Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec
Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec
--------------------------------------------------------------------------------------------------------
12300 (61.50%) 310.254 ms 39 m, 49 s 132.20 K
12400 (62.00%) 319.519 ms 40 m, 28 s 128.29 K
12500 (62.50%) 315.160 ms 39 m, 24 s 130.13 K
12501 (62.50%) 2.51 M 313.963 ms 39 m, 14 s 130.68 K
12600 (63.00%) 318.951 ms 39 m, 20 s 128.57 K
12601 (63.00%) 2.18 M 317.682 ms 39 m, 10 s 129.12 K
12700 (63.50%) 314.319 ms 38 m, 14 s 130.44 K
Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec
--------------------------------------------------------------------------------------------------------
btw, I also used RTX3090.
The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr
Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449 wrapping. Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec -------------------------------------------------------------------------------------------------------- 12300 (61.50%) 310.254 ms 39 m, 49 s 132.20 K 12400 (62.00%) 319.519 ms 40 m, 28 s 128.29 K 12500 (62.50%) 315.160 ms 39 m, 24 s 130.13 K 12501 (62.50%) 2.51 M 313.963 ms 39 m, 14 s 130.68 K 12600 (63.00%) 318.951 ms 39 m, 20 s 128.57 K 12601 (63.00%) 2.18 M 317.682 ms 39 m, 10 s 129.12 K 12700 (63.50%) 314.319 ms 38 m, 14 s 130.44 K Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec --------------------------------------------------------------------------------------------------------
btw, I also used RTX3090.
When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory
The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr
Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449 wrapping. Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec -------------------------------------------------------------------------------------------------------- 12300 (61.50%) 310.254 ms 39 m, 49 s 132.20 K 12400 (62.00%) 319.519 ms 40 m, 28 s 128.29 K 12500 (62.50%) 315.160 ms 39 m, 24 s 130.13 K 12501 (62.50%) 2.51 M 313.963 ms 39 m, 14 s 130.68 K 12600 (63.00%) 318.951 ms 39 m, 20 s 128.57 K 12601 (63.00%) 2.18 M 317.682 ms 39 m, 10 s 129.12 K 12700 (63.50%) 314.319 ms 38 m, 14 s 130.44 K Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec --------------------------------------------------------------------------------------------------------
btw, I also used RTX3090.
When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory
How about trying use docker-compose to start docker container, here is my yaml file:
version: '3'
services:
service1:
container_name: nerf_neurad_studio
image: neurad_studio:v0.1
privileged: true
runtime: nvidia
network_mode: host
devices:
# - "/dev/shm:/dev/shm"
- "/dev/nvidia0:/dev/nvidia0"
environment:
- DISPLAY=$DISPLAY
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- QT_X11_NO_MITSHM=1 # Fix a bug with QT
- SDL_VIDEODRIVER=x11
volumes:
- "/tmp/.X11-unix:/tmp/.X11-unix:rw"
- "/dev/shm:/dev/shm"
- "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
- "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
- "/lib/modules:/lib/modules"
command: tail -f /dev/null
remember change the path:
- image: neurad_studio:v0.1
- *****
- "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio"
- "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
- ***
after start the container,
sudo docker exec -it nerf_neurad_studio /bin/bash
python nerfstudio/scripts/train.py neurad pandaset-data
The connection opened and closed is when I followed the HTTP viser link. That loads but I don't think the training is happening in the background. I'm trying to train locally on rtx3090. Stuck on the same terminal output for the last 1hr
Judging from your log, I find it is a bit strange. I am not sure whether your environment and the cmd you used are correct. I found that your port is not the default 7007, maybe you can try using docker. If everything is OK. you will get log like this:
[04:46:31] Printing max of 10 lines. Set flag --logging.local-writer.max-log-size=0 to disable line writer.py:449 wrapping. Step (% Done) Train Iter (time) ETA (time) Train Rays / Sec Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec -------------------------------------------------------------------------------------------------------- 12300 (61.50%) 310.254 ms 39 m, 49 s 132.20 K 12400 (62.00%) 319.519 ms 40 m, 28 s 128.29 K 12500 (62.50%) 315.160 ms 39 m, 24 s 130.13 K 12501 (62.50%) 2.51 M 313.963 ms 39 m, 14 s 130.68 K 12600 (63.00%) 318.951 ms 39 m, 20 s 128.57 K 12601 (63.00%) 2.18 M 317.682 ms 39 m, 10 s 129.12 K 12700 (63.50%) 314.319 ms 38 m, 14 s 130.44 K Step (% Done) Vis Rays / Sec Train Iter (time) ETA (time) Train Rays / Sec --------------------------------------------------------------------------------------------------------
btw, I also used RTX3090.
When I run it locally in a conda env, the training doesn't start despite waiting for 1h+. When I try to run it inside a docker container, I get #8 . My memory is 250gb of which 237gb is available. Not sure why I'm running out of memory
How about trying use docker-compose to start docker container, here is my yaml file:
version: '3' services: service1: container_name: nerf_neurad_studio image: neurad_studio:v0.1 privileged: true runtime: nvidia network_mode: host devices: # - "/dev/shm:/dev/shm" - "/dev/nvidia0:/dev/nvidia0" environment: - DISPLAY=$DISPLAY - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=all - QT_X11_NO_MITSHM=1 # Fix a bug with QT - SDL_VIDEODRIVER=x11 volumes: - "/tmp/.X11-unix:/tmp/.X11-unix:rw" - "/dev/shm:/dev/shm" - "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio" - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data" - "/lib/modules:/lib/modules" command: tail -f /dev/null
remember change the path:
- "/home/hil/work_zxl/Nerf/neurad-studio:/workspace/neurad-studio" - "/home/hil/work_zxl/Nerf/neurad-studio/data:/data"
Hey, thank you for the yaml file! Unfortunately, the output is the same as running in the conda env i.e the training doesn't start for some reason. It's just stuck after loading the http link and viser websocket
I have not been able to reproduce this issue, which makes it very difficult to address :/ Let's see if the command in #8 helps
Hi, Following your step, I have build the docker image successfully, and I just wanna train the model with pandaset, I start the container use docker-compose, the neurad_docker.yaml file is
then I start the container and exec the command in the docker container:
I got the error like this:
my file structure in the container like this
Additional context since I do not know the right path of pandaset, so I also mkdir a folder named data/pandaset in /workspace/neurad-studio like