ruomingzhai commented 2 years ago

Dear Dr.Robert:

I coded the scannet_training.py according to your scripts/train_scannet. I found that the TRAINING config yaml (Line19) is set to "s3dis_benchmark/sparseconv3d_rgb-pretrained-0", so the program broke down because a nonexistent "fold" parameter in "conf/training/s3dis_benchmark/sparseconv3d_rgb-pretrained-0.yaml".

Then I change the TRAINING to "scannet_benchmark/minkowski-pretrained-0" instead.

But it still broke down while initializing Trainer in "self._model: BaseModel = instantiate_model(copy.deepcopy(self._cfg), self._dataset)" function. I tracked down to "resolve_model" function and it seems the dataset class does not get the "feature_dimension" attributes from the config file.

`def resolve_model(model_config, dataset, tested_task): """ Parses the model config and evaluates any expression that may contain constants """

placeholders to subsitute

constants = {
    "FEAT": max(dataset.feature_dimension, 0),#4
    "TASK": tested_task,
    "N_CLS": dataset.num_classes if hasattr(dataset, "num_classes") else None,
}`

The program entered an infinite loop of finding keywords and then crashed.

The Debug Variables viewer shows the error message:

"/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1111, in indices\n print("indices "+str(len(self)))\n File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len\n return len(self.indices())\n File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1111, in indices\n print("indices "+str(len(self)))\n File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len\n return len(self.indices())\n File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1108, in indices\n version = pyg.version.split(\'.\')\nRecursionError: maximum recursion depth exceeded while calling a Python object\n'"

I think the "FEAT" should get its value from model config file or the feature_dimension should be added in data config file.

So I hope I can get some help from you.

drprojects commented 2 years ago

Hi @ruomingzhai , thanks for the feedback. There was indeed an error on my part there, the training config of scripts/train_scannet should be set to: TRAINING="scannet_benchmark/minkowski-pretrained-pyramid-0".

You almost had it right, except the model used by default scripts/train_scannet involves a pyramid pooling scheme on the 2D feature maps, which calls for dedicated learning rates. That is the diference between scannet_benchmark/minkowski-pretrained-pyramid-0 and scannet_benchmark/minkowski-pretrained-0.

I fixed those in the last commit and things work fine on my end. Can you please try again with these changes and let me know if it works for you ?

ruomingzhai commented 2 years ago

Hi @ruomingzhai , thanks for the feedback. There was indeed an error on my part there, the training config of scripts/train_scannet should be set to: TRAINING="scannet_benchmark/minkowski-pretrained-pyramid-0".

You almost had it right, except the model used by default scripts/train_scannet involves a pyramid pooling scheme on the 2D feature maps, which calls for dedicated learning rates. That is the diference between scannet_benchmark/minkowski-pretrained-pyramid-0 and scannet_benchmark/minkowski-pretrained-0.

I fixed those in the last commit and things work fine on my end. Can you please try again with these changes and let me know if it works for you ?

I have revised the codes following your instructions. However, it still broke down due to the ”feature_dimension” attribute as I mentioned aboved because it went into an infinite iteration in find this attribute in data config. I am not sure what this attribute is and where I can get its value from config file.

drprojects commented 2 years ago

Yes, it seems the model or the dataset could not properly parse the config files to recover the input feature dimension.

Assuming you have pulled the latest commit to your repo, can you first make sure you are able to run scripts/train_scannet.sh with default configurations at least until a few training steps have passed ? I need to make sure this works first. If not, please share the full error traceback here.

ruomingzhai commented 2 years ago

I was running your scripts in scripts/train_scannet.sh after redownloading your project codes.

To be noticed, the whole scantnet dataset is too large for my device so I only processed three scenes for train/val/test to go through the whole procedure.

The error message and data folder are shown in the following: `[34m[1mwandb[39m[22m: [33mWARNING[39m Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt") [34m[1mwandb[39m[22m: [33mWARNING[39m Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt") [34m[1mwandb[39m[22m: [33mWARNING[39m Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt") Error executing job with overrides: ['data=segmentation/multimodal/scannet-sparse', 'models=segmentation/multimodal/sparseconv3d', 'model_name=Res16UNet34-PointPyramid-early-ade20k-interpolate', 'task=segmentation', 'training=scannet_benchmark/minkowski-pretrained-pyramid-0', 'lr_scheduler=exponential', 'eval_frequency=10', 'data.dataroot=/root/share/code/DeepViewAgg/data/scannet', 'training.cuda=0', 'training.batch_size=3', 'training.epochs=300', 'training.num_workers=4', 'training.optim.base_lr=0.1', 'training.wandb.log=True', 'training.wandb.name=My_awesome_ScanNet_experiment', 'tracker_options.make_submission=False', 'training.checkpoint_dir='] Traceback (most recent call last): File "train.py", line 13, in main trainer = Trainer(cfg) File "/root/share/code/DeepViewAgg/torch_points3d/trainer.py", line 46, in init self._initialize_trainer() File "/root/share/code/DeepViewAgg/torch_points3d/trainer.py", line 94, in _initialize_trainer copy.deepcopy(self._cfg), self._dataset) File "/root/share/code/DeepViewAgg/torch_points3d/models/model_factory.py", line 25, in instantiate_model resolve_model(model_config, dataset, task) File "/root/share/code/DeepViewAgg/torch_points3d/utils/model_building_utils/model_definition_resolver.py", line 10, in resolve_model "FEAT": max(dataset.feature_dimension, 0), File "/root/share/code/DeepViewAgg/torch_points3d/datasets/base_dataset.py", line 53, in wrapper result = func(self, *args, **kwargs) File "/root/share/code/DeepViewAgg/torch_points3d/datasets/base_dataset.py", line 456, in feature_dimension if self.train_dataset: File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len return len(self.indices()) File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1077, in indices return range(len(self)) if self._indices is None else self._indices

                      （Thousands of identical error messages...）

File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1077, in indices return range(len(self)) if self._indices is None else self._indices File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len return len(self.indices()) File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1077, in indices return range(len(self)) if self._indices is None else self._indices File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len return len(self.indices()) File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1077, in indices return range(len(self)) if self._indices is None else self._indices File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len return len(self.indices()) File "/root/share/code/DeepViewAgg/torch_points3d/datasets/segmentation/scannet.py", line 1077, in indices return range(len(self)) if self._indices is None else self._indices File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len return len(self.indices()) RecursionError: maximum recursion depth exceeded Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. File "/root/.local/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 176, in len`

I only revised the dataRoot:

drprojects commented 2 years ago

I have seen this type of error before, when the installed torch-geometric version is too high. Can you please check that you are using torch-geometric==1.7.2 ?

Have you been able to install the project using install.sh without any issue ? It is important that you do so. Several dependencies are quite fragile. In particular torch-points3d (on top of which this project is built) does not support most recent versions of torch-geometric due to some backward incompatibilities in torch-geometric, so we need to be extra careful with the torch and torch-geometric versions we use here.

ruomingzhai commented 2 years ago

Thank you for your tips! My device has already install cuda, so I just look at install.sh for required packages instead of running it.

Maybe it results from some packages in install.sh as you said.

By the way I found it quite weird that it seems ${cu110} in Line 121 is not a variable defined in advance but a string. So the error comes with the following: 18c5c78419f99371e64e5510d813afc

I will follow the install.sh to reinstall these packages later! Thank you so much!

drprojects commented 2 years ago

Hi @ruomingzhai, thanks for the feedback !

Yes, unfortunately, the installation of some dependencies is quite fragile and that is why the install script is so specific on the pytorch and pytorch geometric versions. torch-points3D does not support pytorch geometric 2+ yet, but when it does, I hope I can move everything to more recent versions of pytorch geometric and torch.

Anyways, good catch about Line 121-122 should no longer be there, I forgot to remove them !

I just updated the install.sh script, I let you see if things work for you now. I close this issue for now, since we know the reason for the error you had.

drprojects / DeepViewAgg

data config for training scannet dataset #5

placeholders to subsitute