Training time for code reproduction

Epsilon404 commented 7 months ago

Hi, thanks for your work and I am interested in it!

But during the reproduction of your work, I'm suffering from some issues.

There are two datasets in EndoNeRF dataset, cutting_tissues_twice and pulling_soft_tissues, I wonder which dataset was used in the results of Table 1 in the paper, and how the error like ±1.387 was computed.
The "9k" and "32k" in the paper represent the number of iterations, but in the config files example-9k.py and example-32k.py the parameter num_steps are set to 1200 and 3600, and they are also the actual iteration number during training. So I'm wondering why and what is the relation between 9k and 1200.
I set up the environment on Ubuntu 20.04 and trained with a single RTX3090, and here are my environment dependencies:

environment dependencies

dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - ca-certificates=2023.08.22=h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - openssl=3.0.11=h7f8727e_2
  - pip=23.3=py39h06a4308_0
  - python=3.9.18=h955ad1f_0
  - readline=8.2=h5eee18b_0
  - setuptools=68.0.0=py39h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.41.2=py39h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - absl-py==2.0.0
      - addict==2.4.0
      - ansi2html==1.8.0
      - asttokens==2.4.1
      - attrs==23.1.0
      - blinker==1.6.3
      - cachetools==5.3.2
      - certifi==2023.7.22
      - charset-normalizer==3.3.2
      - click==8.1.7
      - comm==0.1.4
      - configargparse==1.7
      - contourpy==1.1.1
      - cycler==0.12.1
      - dash==2.14.1
      - dash-core-components==2.0.0
      - dash-html-components==2.0.0
      - dash-table==5.0.0
      - decorator==5.1.1
      - exceptiongroup==1.1.3
      - executing==2.0.1
      - fastjsonschema==2.18.1
      - flask==3.0.0
      - fonttools==4.43.1
      - google-auth==2.23.4
      - google-auth-oauthlib==1.1.0
      - grpcio==1.59.2
      - idna==3.4
      - imageio==2.31.6
      - imageio-ffmpeg==0.4.9
      - importlib-metadata==6.8.0
      - importlib-resources==6.1.0
      - ipython==8.17.2
      - ipywidgets==8.1.1
      - itsdangerous==2.1.2
      - jax==0.4.19
      - jedi==0.19.1
      - jinja2==3.1.2
      - joblib==1.3.2
      - jsonschema==4.19.2
      - jsonschema-specifications==2023.7.1
      - jupyter-core==5.5.0
      - jupyterlab-widgets==3.0.9
      - kiwisolver==1.4.5
      - lazy-loader==0.3
      - lightning-utilities==0.9.0
      - lpips==0.1.4
      - markdown==3.5.1
      - markdown-it-py==3.0.0
      - markupsafe==2.1.3
      - matplotlib==3.8.1
      - matplotlib-inline==0.1.6
      - mdurl==0.1.2
      - ml-dtypes==0.3.1
      - nbformat==5.7.0
      - nerfacc==0.5.0
      - nest-asyncio==1.5.8
      - networkx==3.2.1
      - ninja==1.11.1.1
      - numpy==1.26.1
      - nvidia-cuda-nvrtc-cu11==11.7.99
      - nvidia-cuda-runtime-cu11==11.7.99
      - nvidia-cudnn-cu11==8.5.0.96
      - nvidia-ml-py==12.535.133
      - nvitop==1.3.1
      - oauthlib==3.2.2
      - open3d==0.17.0
      - opencv-python==4.8.1.78
      - opt-einsum==3.3.0
      - packaging==23.2
      - pandas==2.1.2
      - parso==0.8.3
      - pexpect==4.8.0
      - pillow==10.0.1
      - platformdirs==3.11.0
      - plotly==5.18.0
      - prompt-toolkit==3.0.39
      - protobuf==4.23.4
      - psutil==5.9.6
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - pyasn1==0.5.0
      - pyasn1-modules==0.3.0
      - pybind11==2.11.1
      - pygments==2.16.1
      - pyparsing==3.1.1
      - pyquaternion==0.9.9
      - python-dateutil==2.8.2
      - pytorch-msssim==1.0.0
      - pytz==2023.3.post1
      - pyyaml==6.0.1
      - referencing==0.30.2
      - requests==2.31.0
      - requests-oauthlib==1.3.1
      - retrying==1.3.4
      - rich==13.6.0
      - rpds-py==0.10.6
      - rsa==4.9
      - scikit-image==0.22.0
      - scikit-learn==1.3.2
      - scipy==1.11.3
      - six==1.16.0
      - stack-data==0.6.3
      - tenacity==8.2.3
      - tensorboard==2.15.0
      - tensorboard-data-server==0.7.2
      - termcolor==2.3.0
      - threadpoolctl==3.2.0
      - tifffile==2023.9.26
      - tinycudann==1.7
      - torch==1.13.1
      - torchmetrics==1.2.0
      - torchvision==0.14.1
      - tqdm==4.66.1
      - traitlets==5.13.0
      - typing-extensions==4.8.0
      - tzdata==2023.3
      - urllib3==2.0.7
      - wcwidth==0.2.9
      - werkzeug==3.0.1
      - widgetsnbextension==4.0.9
      - zipp==3.17.0

But after the training with the unchanged parameters and 1200 or 3600 num_steps, on the two datasets of EndoNeRF, I got the results with much longer time than in Table 1 of paper: So could you please tell me how to reproduce the result of 3 min or are there any additional environment or config settings before training?

Thank you very much!

Loping151 commented 7 months ago

In Table 1, we present metrics as mean ± covariance on all 6 EndoNeRF datasets. We can plot them like this(nothing to do with Lerplane article):
In EndoNeRF, one batch is 2048 rays. The actual iter is calculated by: {iteration in config: 1200}*{our batch size: 16384}/2048 = 9k and 3600 for 32k. The number is not very precise, only used to distinguish them.

Sorry for the 3min error. I noticed that the version uploaded was the version I used to make the video in README...which means, this version saves an image every 6 iter, which caused the training to pause. You should notice your GPU usage drops very often and there's a warning of accumulation...Anyway, you can directly remove line 98 and 99 in video_trainer.py, or just git pull. It should be:

def train_step(self, data: Dict[str, Union[int, torch.Tensor]], **kwargs):
    scale_ok = super().train_step(data, **kwargs)
    # if self.global_step in [900,1800,2700,3600]:
    #     self.validate(self.global_step)
    # if self.global_step % 6 == 0:
    #     self.validate(video_frame = self.global_step // 6)
    if self.global_step == self.isg_step:
        self.train_dataset.enable_isg()
        raise StopIteration  # Whenever we change the dataset
    if self.global_step == self.ist_step:
        self.train_dataset.switch_isg2ist()
        raise StopIteration  # Whenever we change the dataset

    return scale_ok

I've tested it on an RTX4090 GPU and it takes 2:47 for 9k. Should take around 3min on an RTX3090.

Epsilon404 commented 7 months ago

Thank you for your reply! I have got an appropriate time for training after the adjustment.

But I still have a question that how can I get all 6 datasets of EndoNeRF to conduct a thorough test, cuz I only got two sample datasets from EndoNeRF. I'd appreciate it if you could tell me how.

BTW, I also have another problem that is the code for extracting 3D mesh or point cloud model in the repository? Could you please give me some tips for extracting the 3D model to show the results like in the article? Thanks!

Loping151 commented 7 months ago

Well, for the dataset, you may need to contact the author of EndoNeRF for the full dataset. Surgical data is kind of sensitive you know. About the pointcloud, we use the same evaluation protocol as EndoNeRF. So please follow the instructions in EndoNeRF.

Epsilon404 commented 7 months ago

Okay I see, thanks again!

Loping151 / ForPlane

Training time for code reproduction #6