IMSY-DKFZ / htc

Semantic organ segmentation for hyperspectral images.
Other
28 stars 5 forks source link

[Question] HTC training configs #17

Closed alfieroddan closed 9 months ago

alfieroddan commented 9 months ago

:question: Question

Hi there, I'm having a little bit of trouble using the htc training CLI entrypoint. Could you point me in the right direction?

Description

So following the networktraining.ipynb tutorial I create a config using the following script:

"""
Small script to generate a config for training.
"""

import htc
from pathlib import Path

def main():
    "Simple main function"
    # load default config
    config = htc.Config.from_model_name("default", "image")
    # data and image
    config["input/data_spec"] = "/home/tay/Code/ms-seg/hs_tools/pigs_thoracic_2folds.json"
    # inherits
    config["inherits"] = "models/image/configs/default"
    # ensure all three annotations
    config["input/annotation_name"] = [
        "polygon#annotator1",
        "polygon#annotator2",
        "polygon#annotator3"
    ]
    # merge annotations from all annotators into one maso
    config["input/merge_annotations"] = "union"
    # model
    config["model/pretrained_model"] = {
        "model": "image",
        # "2022-02-03_22-58-44_generated_default_model_comparison"
        "run_folder": "2023-02-08_14-48-02_organ_transplantation_0.8",
    }
    # devices
    config["trainer_kwargs/devices"] = 1
    # save config
    save_path = Path("hs_tools/Test-config.json")
    config.save_config(save_path)

if __name__ == "__main__":
    main()

This results in the following config:

{
    "config_name": "default",
    "dataloader_kwargs": {
        "batch_size": 5,
        "num_workers": 1
    },
    "inherits": "models/image/configs/default",
    "input": {
        "annotation_name": [
            "polygon#annotator1",
            "polygon#annotator2",
            "polygon#annotator3"
        ],
        "data_spec": "/home/tay/Code/ms-seg/hs_tools/pigs_thoracic_2folds.json",
        "epoch_size": 500,
        "merge_annotations": "union",
        "n_channels": 100,
        "preprocessing": "L1",
        "transforms_gpu": [
            {
                "class": "KorniaTransform",
                "degrees": 45,
                "p": 0.5,
                "padding_mode": "reflection",
                "scale": [
                    0.9,
                    1.1
                ],
                "transformation_name": "RandomAffine",
                "translate": [
                    0.0625,
                    0.0625
                ]
            },
            {
                "class": "KorniaTransform",
                "p": 0.25,
                "transformation_name": "RandomHorizontalFlip"
            },
            {
                "class": "KorniaTransform",
                "p": 0.25,
                "transformation_name": "RandomVerticalFlip"
            }
        ]
    },
    "label_mapping": "htc.settings_seg>label_mapping",
    "lightning_class": "htc.models.image.LightningImage>LightningImage",
    "model": {
        "architecture_kwargs": {
            "encoder_name": "efficientnet-b5",
            "encoder_weights": "imagenet"
        },
        "architecture_name": "Unet",
        "model_name": "ModelImage",
        "pretrained_model": {
            "model": "image",
            "run_folder": "2023-02-08_14-48-02_organ_transplantation_0.8"
        }
    },
    "optimization": {
        "lr_scheduler": {
            "gamma": 0.99,
            "name": "ExponentialLR"
        },
        "optimizer": {
            "lr": 0.001,
            "name": "Adam",
            "weight_decay": 0
        }
    },
    "swa_kwargs": {
        "annealing_epochs": 0
    },
    "trainer_kwargs": {
        "accelerator": "gpu",
        "devices": 1,
        "max_epochs": 100,
        "precision": "16-mixed"
    },
    "validation": {
        "checkpoint_metric": "dice_metric",
        "dataset_index": 0
    }
}

Using the pigs_thoracic_2folds.json dataspec generated by the tutorial as the dataset.

When I run the following:

PATH_Tivita_HeiPorSPECTRAL=/media/tay/4TB/Datasets/HeiPorSPECTRAL PATH_HTC_RESULTS=results/ htc training --model image --config /home/tay/Code/ms-seg/hs_tools/Test-config.json

I get the following error:

(env) tay@tay:~/Code/ms-seg$ PATH_Tivita_HeiPorSPECTRAL=/media/tay/4TB/Datasets/HeiPorSPECTRAL PATH_HTC_RESULTS=results/ htc trainin
g --model image --config /home/tay/Code/ms-seg/hs_tools/Test-config.json
[INFO][htc] Starting training of the fold fold_P093 [1/2]                                                        run_training.py:296
[CRITICAL][htc] Uncaught exception:                                                                              run_training.py:379
Traceback (most recent call last):                                                                                                  
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 395, in                            
<module>                                                                                                                            
    fold_trainer = FoldTrainer(args.model, args.config, config_extends)                                                             
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 36, in __init__                    
    self.config = Config.from_model_name(config_name, model_name, use_shared_dict=True)                                             
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/utils/Config.py", line 436, in                                   
from_model_name                                                                                                                     
    raise ValueError(                                                                                                               
ValueError: Cannot find the configuration file Test-config. Tried the following locations:                                          
[PosixPath('Test-config.json'),                                                                                                     
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/Test-config.json'),                                    
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'),                                           
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'),                                           
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/image/configs/Test-config.json')]                      
Traceback (most recent call last):
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 395, in <module>
    fold_trainer = FoldTrainer(args.model, args.config, config_extends)
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 36, in __init__
    self.config = Config.from_model_name(config_name, model_name, use_shared_dict=True)
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/utils/Config.py", line 436, in from_model_name
    raise ValueError(
ValueError: Cannot find the configuration file Test-config. Tried the following locations: [PosixPath('Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/image/configs/Test-config.json')]
[ERROR][htc] Training of the fold fold_P093 was not successful (returncode=1                                     run_training.py:299
[INFO][htc] Starting training of the fold fold_P094 [2/2]                                                        run_training.py:296
[CRITICAL][htc] Uncaught exception:                                                                              run_training.py:379
Traceback (most recent call last):                                                                                                  
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 395, in                            
<module>                                                                                                                            
    fold_trainer = FoldTrainer(args.model, args.config, config_extends)                                                             
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 36, in __init__                    
    self.config = Config.from_model_name(config_name, model_name, use_shared_dict=True)                                             
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/utils/Config.py", line 436, in                                   
from_model_name                                                                                                                     
    raise ValueError(                                                                                                               
ValueError: Cannot find the configuration file Test-config. Tried the following locations:                                          
[PosixPath('Test-config.json'),                                                                                                     
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/Test-config.json'),                                    
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'),                                           
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'),                                           
PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/image/configs/Test-config.json')]                      
Traceback (most recent call last):
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 395, in <module>
    fold_trainer = FoldTrainer(args.model, args.config, config_extends)
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/run_training.py", line 36, in __init__
    self.config = Config.from_model_name(config_name, model_name, use_shared_dict=True)
  File "/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/utils/Config.py", line 436, in from_model_name
    raise ValueError(
ValueError: Cannot find the configuration file Test-config. Tried the following locations: [PosixPath('Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/Test-config.json'), PosixPath('/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/models/image/configs/Test-config.json')]
[ERROR][htc] Training of the fold fold_P094 was not successful (returncode=1                                     run_training.py:299
[ERROR][htc] Some folds were not successful (see error messages above)                                           run_training.py:303
[INFO][htc] Training time for the all folds: 0 minutes and 4.73 seconds                                          run_training.py:305

So clearly the self.config = Config.from_model_name(config_name, model_name, use_shared_dict=True) is looking for a relative path for the config (instead of the explicit one I supplied).

Am I using the above wrong? I follow you tutorial pretty closely.

HTC info

htc framework
- version: 0.0.13
- url: https://github.com/imsy-dkfz/htc
- git commit: 074c0f97ea6e420032c68766f4324b9e0dcb73c2

User settings:
No user settings found. If you want to use your user settings to specify environment variables, please create the file 
/home/tay/.config/htc/variables.env and add your environment variables, for example:
export PATH_HTC_NETWORK="/path/to/your/network/dir"
export PATH_Tivita_my_dataset="~/htc/Tivita_my_dataset:shortcut=my_shortcut"

.env settings:
No .env file found. If you cloned the repository and installed the htc framework in editable mode, you can create a .env file in the
repository root (more precisely, at /home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc/.env) and fill it with variables, for
example:
export PATH_HTC_NETWORK="/path/to/your/network/dir"
export PATH_Tivita_my_dataset="~/htc/Tivita_my_dataset:shortcut=my_shortcut"

Environment variables:

Datasets:
<htc.utils.Datasets.DatasetAccessor object at 0x7f8e0a7d9600>

Other directories:
[WARNING][htc] Could not find the environment variable PATH_HTC_RESULTS so that a results directory will not be      settings.py:503
available (scripts which use settings.results_dir will crash)                                                                       
None
[WARNING][htc] Could not find an intermediates directory, probably because no data directory was found               settings.py:460
None
src_dir=/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc
htc_package_dir=/home/tay/Code/ms-seg/env/lib/python3.10/site-packages/htc

System:
Collecting environment information...
PyTorch version: 2.1.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.27.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         12th Gen Intel(R) Core(TM) i9-12900K
CPU family:                         6
Model:                              151
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           2
CPU max MHz:                        5200.0000
CPU min MHz:                        800.0000
BogoMIPS:                           6374.40
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualisation:                     VT-x
L1d cache:                          640 KiB (16 instances)
L1i cache:                          768 KiB (16 instances)
L2 cache:                           14 MiB (10 instances)
L3 cache:                           30 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.7.1
[pip3] numpy==1.26.1
[pip3] pytorch-ignite==0.4.11
[pip3] pytorch-lightning==2.1.0
[pip3] segmentation-models-pytorch==0.3.3
[pip3] torch==2.1.0
[pip3] torchmetrics==1.2.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2023.1.0         h6d00ec8_46342  
[conda] mkl-service               2.4.0           py311h5eee18b_1  
[conda] mkl_fft                   1.3.6           py311ha02d727_1  
[conda] mkl_random                1.2.2           py311ha02d727_1  
[conda] numpy                     1.24.3          py311h08b1b3b_1  
[conda] numpy-base                1.24.3          py311hf175353_1  
[conda] numpydoc                  1.5.0           py311h06a4308_0  
[conda] pytorch                   2.0.1           cpu_py311h6d93b4c_0
JanSellner commented 9 months ago

There was a bug in the run_training.py script with absolute paths to the config. This is fixed in the latest master by 0f60dfd6987723c91ce153cf0cb991a85f5f4381

Thank you very much for reporting!