NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)
https://research.nvidia.com/labs/dir/neuralangelo/
Other
4.33k stars 387 forks source link

BrokenPipeError occurs with wandb option #84

Open ZirongChan opened 1 year ago

ZirongChan commented 1 year ago

Thx for the great work.

I was running the toy_experiment with the lego data, expect that I ran colmap and then the data generation scripts on a dataset which I've downloaded long time ago for NeRF. So the data I used might be different from the one provided in this repo.

The toy experiment rans well, although the background remains. Issue occurs when I tried using the --wandb option. Data loaded, the communication with wandb website was fine too. Everything went fine until the 4th epoch of training. An error rised with "BrokenPipeError: [Errno 32] Broken Pipe". It tracked back to the torch.distributed.elastic.multiprocessing.error.ChildFailedError.

Does anyone have the same issue?

chenhsuanlin commented 1 year ago

Hi @ZirongChan could you post the full error log? Thanks!

ZirongChan commented 1 year ago

Hi @ZirongChan could you post the full error log? Thanks!

thx for your reply, @chenhsuanlin

Of course, the following is the log output in terminal: torchrun --nproc_per_node=1 train.py --logdir=logs/nerf_synthesis/lego_wandb --config=projects/neuralangelo/configs/custom/lego.yaml --show_pbar --wandb Training with 1 GPUs. Using random seed 0 Make folder logs/nerf_synthesis/lego_wandb


I will also paste the content of the "debug-internal.log" file: 2023-08-29 02:25:55,696 INFO StreamThr :64464 [internal.py:wandb_internal():86] W&B internal server running at pid: 64464, started at: 2023-08-29 02:25:55.694795 2023-08-29 02:25:55,697 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status 2023-08-29 02:25:55,700 INFO WriterThread:64464 [datastore.py:open_for_write():85] open: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/run-ox3ubipj.wandb 2023-08-29 02:25:55,702 DEBUG SenderThread:64464 [sender.py:send():379] send: header 2023-08-29 02:25:55,775 DEBUG SenderThread:64464 [sender.py:send():379] send: run 2023-08-29 02:26:00,776 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:05,778 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:08,240 INFO SenderThread:64464 [dir_watcher.py:init():211] watching files in: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files 2023-08-29 02:26:08,240 INFO SenderThread:64464 [sender.py:_start_run_threads():1121] run started: ox3ubipj with start time 1693275955.695242 2023-08-29 02:26:08,240 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: summary_record 2023-08-29 02:26:08,240 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:08,242 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file wandb-summary.json with policy end 2023-08-29 02:26:08,248 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: check_version 2023-08-29 02:26:08,249 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: check_version 2023-08-29 02:26:09,243 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/wandb-summary.json 2023-08-29 02:26:12,803 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: run_start 2023-08-29 02:26:12,808 DEBUG HandlerThread:64464 [system_info.py:init():31] System info init 2023-08-29 02:26:12,808 DEBUG HandlerThread:64464 [system_info.py:init():46] System info init done 2023-08-29 02:26:12,808 INFO HandlerThread:64464 [system_monitor.py:start():181] Starting system monitor 2023-08-29 02:26:12,808 INFO SystemMonitor:64464 [system_monitor.py:_start():145] Starting system asset monitoring threads 2023-08-29 02:26:12,808 INFO HandlerThread:64464 [system_monitor.py:probe():201] Collecting system info 2023-08-29 02:26:12,809 INFO SystemMonitor:64464 [interfaces.py:start():190] Started cpu monitoring 2023-08-29 02:26:12,810 INFO SystemMonitor:64464 [interfaces.py:start():190] Started disk monitoring 2023-08-29 02:26:12,810 INFO SystemMonitor:64464 [interfaces.py:start():190] Started gpu monitoring 2023-08-29 02:26:12,811 INFO SystemMonitor:64464 [interfaces.py:start():190] Started memory monitoring 2023-08-29 02:26:12,812 INFO SystemMonitor:64464 [interfaces.py:start():190] Started network monitoring 2023-08-29 02:26:12,839 DEBUG HandlerThread:64464 [system_info.py:probe():195] Probing system 2023-08-29 02:26:12,845 DEBUG HandlerThread:64464 [system_info.py:_probe_git():180] Probing git 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_info.py:_probe_git():188] Probing git done 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_info.py:probe():240] Probing system done 2023-08-29 02:26:12,861 DEBUG HandlerThread:64464 [system_monitor.py:probe():210] {'os': 'Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.27', 'python': '3.9.16', 'heartbeatAt': '2023-08-29T02:26:12.839871', 'startedAt': '2023-08-29T02:25:55.673713', 'docker': None, 'cuda': None, 'args': ('--logdir=logs/nerf_synthesis/lego_wandb', '--config=projects/neuralangelo/configs/custom/lego.yaml', '--show_pbar', '--wandb'), 'state': 'running', 'program': '/zhanghuaimin01/Workspace/neuralangelo/train.py', 'codePath': 'train.py', 'git': {'remote': 'https://github.com/NVlabs/neuralangelo.git', 'commit': 'f740c689808537074d46a9d56f8bec2c0be93c7e'}, 'email': 'hansen@orbbec.com', 'root': '/zhanghuaimin01/Workspace/neuralangelo', 'host': 'a0q74jbdps9k3-0', 'username': 'root', 'executable': '/root/anaconda3/envs/neuralangelo/bin/python', 'cpu_count': 64, 'cpu_count_logical': 128, 'cpu_freq': {'current': 3.399999999999993, 'min': 800.0, 'max': 3400.0}, 'cpu_freq_per_core': [{'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}, {'current': 3.4, 'min': 800.0, 'max': 3400.0}], 'disk': {'total': 3539.7356147766113, 'used': 744.4414558410645}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42505273344}], 'memory': {'total': 1007.3468627929688}} 2023-08-29 02:26:12,862 INFO HandlerThread:64464 [system_monitor.py:probe():211] Finished collecting system info 2023-08-29 02:26:12,862 INFO HandlerThread:64464 [system_monitor.py:probe():214] Publishing system info 2023-08-29 02:26:12,862 DEBUG HandlerThread:64464 [system_info.py:_save_pip():51] Saving list of pip packages installed into the current environment 2023-08-29 02:26:12,864 DEBUG HandlerThread:64464 [system_info.py:_save_pip():67] Saving pip packages done 2023-08-29 02:26:12,865 DEBUG HandlerThread:64464 [system_info.py:_save_conda():74] Saving list of conda packages installed into the current environment 2023-08-29 02:26:13,245 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/conda-environment.yaml 2023-08-29 02:26:13,245 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/requirements.txt 2023-08-29 02:26:19,113 DEBUG HandlerThread:64464 [system_info.py:_save_conda():86] Saving conda packages done 2023-08-29 02:26:19,117 INFO HandlerThread:64464 [system_monitor.py:probe():216] Finished publishing system info 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: keepalive 2023-08-29 02:26:19,121 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:19,122 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:19,123 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file wandb-metadata.json with policy now 2023-08-29 02:26:19,127 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: stop_status 2023-08-29 02:26:19,127 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: stop_status 2023-08-29 02:26:19,249 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/conda-environment.yaml 2023-08-29 02:26:19,249 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/wandb-metadata.json 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: telemetry 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: config 2023-08-29 02:26:19,825 DEBUG SenderThread:64464 [sender.py:send():379] send: telemetry 2023-08-29 02:26:20,194 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/t9dpsmie-wandb-metadata.json 2023-08-29 02:26:20,250 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:22,252 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:23,828 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:28,261 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:29,300 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:30,265 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/config.yaml 2023-08-29 02:26:32,598 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:32,599 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:34,127 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: stop_status 2023-08-29 02:26:34,127 DEBUG SenderThread:64464 [sender.py:send_request():406] send_request: stop_status 2023-08-29 02:26:34,352 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:35,353 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,353 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png with policy now 2023-08-29 02:26:35,353 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:35,429 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,429 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png with policy now 2023-08-29 02:26:35,456 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:35,456 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:35,466 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:35,480 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis 2023-08-29 02:26:35,493 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images 2023-08-29 02:26:35,514 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,529 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/rgb_error_0_26876becb829857eefc2.png with policy now 2023-08-29 02:26:35,577 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,577 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/normal_0_7e1272e24357100780aa.png with policy now 2023-08-29 02:26:35,650 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,650 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png with policy now 2023-08-29 02:26:35,719 DEBUG SenderThread:64464 [sender.py:send():379] send: files 2023-08-29 02:26:35,719 INFO SenderThread:64464 [sender.py:_save_file():1375] saving file media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png with policy now 2023-08-29 02:26:35,719 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: partial_history 2023-08-29 02:26:36,147 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/gedptt0m-media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:36,497 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:36,497 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:36,502 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:36,515 INFO Thread-12 :64464 [dir_watcher.py:_on_file_created():271] file/dir created: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:36,523 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/media/images/val/vis 2023-08-29 02:26:36,543 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/fsbkbi5x-media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:36,647 INFO wandb-upload_5:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/6sdhpbpw-media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:36,935 INFO wandb-upload_4:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/bdsumm31-media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:37,015 INFO wandb-upload_3:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/l7ppndyh-media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:37,315 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/z8ms38am-media/images/val/vis/rgb_render_0_0221db07b3e58dba4e0c.png 2023-08-29 02:26:37,442 INFO wandb-upload_2:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/abedw3ot-media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:37,877 INFO wandb-upload_5:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/zj43bbvk-media/images/val/vis/inv_depth_0_fbd0daa20de9191a79ae.png 2023-08-29 02:26:37,912 INFO wandb-upload_0:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/83lwwq0x-media/images/val/vis/rgb_target_0_39365b313d2292dd4eba.png 2023-08-29 02:26:38,418 INFO wandb-upload_1:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/w8geu1ah-media/images/val/vis/opacity_0_5608dd29b12d5fdfbf5d.png 2023-08-29 02:26:38,436 INFO wandb-upload_3:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/fhn1dp2f-media/images/val/vis/normal_0_7e1272e24357100780aa.png 2023-08-29 02:26:38,560 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:39,096 INFO wandb-upload_2:64464 [upload_job.py:push():131] Uploaded file /tmp/tmpostfmuwswandb/tpg4ibp2-media/images/val/vis/rgb_error_0_26876becb829857eefc2.png 2023-08-29 02:26:40,374 DEBUG HandlerThread:64464 [handler.py:handle_request():144] handle_request: status_report 2023-08-29 02:26:40,700 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log 2023-08-29 02:26:42,775 INFO Thread-12 :64464 [dir_watcher.py:_on_file_modified():288] file/dir modified: logs/nerf_synthesis/lego_wandb/wandb/run-20230829_022555-ox3ubipj/files/output.log


can it be a problem about my internet connection? or is there an alternative that I can use tensorBoard to visualize the training? thx

chenhsuanlin commented 1 year ago

This seems to be an issue on the W&B side. We don't support Tensorboard right now, but PRs are welcome if you'd like to help add this support.

ZirongChan commented 1 year ago

This seems to be an issue on the W&B side. We don't support Tensorboard right now, but PRs are welcome if you'd like to help add this support.

It seems to be an issue related to the distributed training. I've also tried setting the --single_gpu flag, it did not work. The error log was still about distributed training, as in the log "raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:" .

Is there any switch somewhere else in the code that I can make sure the distributed training is disabled?

chenhsuanlin commented 1 year ago

To disable distributed training, you can run python train.py --single_gpu ... instead of torchrun --nproc_per_node=1 train.py ... and it should work.