clemsgrs / hipt

Re-implementation of HIPT
17 stars 7 forks source link

Some questions #2

Closed bryanwong17 closed 1 year ago

bryanwong17 commented 1 year ago

Hi @clemsgrs , I'm trying to use your HIPT implementation code with the CAMELYON16 dataset and having a few questions:

  1. As TCGA and camelyon16 datasets are different, should I initialize the checkpoints pretrained model and train DINO models again for both local and global? How many epochs do you think would be sufficient?
  2. What should I put inside of train.csv, tune.csv, and test.csv?
  3. Why is img_size_4096 equal to 3584? Is it teacher size in region level?
  4. What is label_name in default.yaml?
  5. Where to put actual labels for each slide?
  6. Could you help me to solve this issue?

wandb: Appending key for api.wandb.ai to your netrc file: /home/bryan/.netrc Error executing job with overrides: [] Traceback (most recent call last): File "extract_features.py", line 159, in main() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main _run_hydra( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra _run_app( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app run_and_report( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report raise ex File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report return func() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in lambda: hydra.run( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "extract_features.py", line 39, in main wandb_run = initialize_wandb(cfg, key=key) File "/mnt/d/hipt/source/utils.py", line 67, in initialize_wandb group=cfg.wandb.group, File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 359, in getattr self._format_and_raise(key=key, value=None, cause=e) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise format_and_raise( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 819, in format_and_raise _raise(ex, cause) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in getattr return self._get_impl( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl node = self._get_child( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 73, in _get_child child = self._get_node( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 475, in _get_node self._validate_get(key) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get self._format_and_raise( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise format_and_raise( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 899, in format_and_raise _raise(ex, cause) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace omegaconf.errors.ConfigAttributeError: Key 'group' is not in struct full_key: wandb.group object_type=dict

Thank you!

clemsgrs commented 1 year ago

Hi @bryanwong17, happy to help:

1. As TCGA and camelyon16 datasets are different, should I initialize the checkpoints pretrained model and train DINO models again for both local and global? How many epochs do you think would be sufficient?

As long as you keep the img_size_* and patch_size_* arguments equal to what's in the default.yaml config, you can use the same pre-trained checkpoints. Otherwise, you could still use these checkpoints but it could lead to poor results (for different reasons).

2. What should I put inside of train.csv, tune.csv, and test.csv?

Sorry, I forgot to give details about this in the README.md. These file should contain the slide ids & corresponding classification labels:

slide_id,label
TRAIN_1,1
TRAIN_2,1
...

3. Why is img_size_4096 equal to 3584? Is it teacher size in region level?

This has to do with how the intermediate Transformer block was pre-trained (using crops). See HIPT author explaining why here:

"To be more exact, technically the image size for VisionTransformer4K should be 3584 while the patch size is 256, as during pretraining, as the maximum global crop size is [14 x 14] in a [16 x 16 x 384] 2D grid of pre-extracted feature embeddings of 256-sized patches."

4.What is label_name in default.yaml?

label_name should correspond to the name of the column holding the labels in your train.csv & tune.csv files. If the column name is "label", then label_name = "label".

5. Where to put actual labels for each slide?

See previous answers

6. Could you help me to solve this issue?

Just add a blank wandb.group parameter to your config file:

wandb:
  project: 'hipt'
  username: 'clemsg'
  exp_name: 'toy_training'
  dir: '/home/user'
  to_log: ['loss', 'auc', 'kappa', 'roc_auc_curve']
  group:

Let me know if you have any further questions.

bryanwong17 commented 1 year ago

Hi @clemsgrs, thanks for your answers.

Are you suggesting I use the checkpoint models provided by the HIPT authors? Is it unnecessary to retrain DINO again with the initialize checkpoint since it would give poorer results?

I still have below error:

Traceback (most recent call last): File "extract_features.py", line 158, in main() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main _run_hydra( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra _run_app( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app run_and_report( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report raise ex File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report return func() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in lambda: hydra.run( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 105, in run cfg = self.compose_config( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 594, in compose_config cfg = self.config_loader.load_configuration( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 141, in load_configuration return self._load_configuration_impl( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 235, in _load_configuration_impl self._process_config_searchpath(config_name, parsed_overrides, caching_repo) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/config_loader_impl.py", line 158, in _process_config_searchpath loaded = repo.load_config(config_path=config_name) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/config_repository.py", line 349, in load_config ret = self.delegate.load_config(config_path=config_path) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/config_repository.py", line 92, in load_config ret = source.load_config(config_path=config_path) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/core_plugins/file_config_source.py", line 31, in load_config cfg = OmegaConf.load(f) File "/home/bryan/anaconda3/envs/hipttry/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 192, in load obj = yaml.load(file, Loader=get_yaml_loader()) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/init.py", line 81, in load return loader.get_single_data() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/constructor.py", line 49, in get_single_data node = self.get_single_node() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 36, in get_single_node document = self.compose_document() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 55, in compose_document node = self.compose_node(None, None) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 133, in compose_mapping_node item_value = self.compose_node(node, item_key) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 84, in compose_node node = self.compose_mapping_node(anchor) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/composer.py", line 127, in compose_mapping_node while not self.check_event(MappingEndEvent): File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/parser.py", line 98, in check_event self.current_event = self.state() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key if self.check_token(KeyToken): File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/scanner.py", line 116, in check_token self.fetch_more_tokens() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/scanner.py", line 173, in fetch_more_tokens return self.fetch_stream_end() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/scanner.py", line 377, in fetch_stream_end self.remove_possible_simple_key() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/yaml/scanner.py", line 318, in remove_possible_simple_key raise ScannerError("while scanning a simple key", key.mark, yaml.scanner.ScannerError: while scanning a simple key in "/mnt/d/hipt/config/feature_extraction/default.yaml", line 28, column 3 could not find expected ':' in "/mnt/d/hipt/config/feature_extraction/default.yaml", line 28, column 10

clemsgrs commented 1 year ago

There's something wrong in your config file, could you put it down here?

bryanwong17 commented 1 year ago

output/camelyon16/features/hipt/local already exists! deleting it... done wandb: Appending key for api.wandb.ai to your netrc file: /home/bryan/.netrc wandb: Currently logged in as: bryanwong9095. Use wandb login --relogin to force relogin wandb: WARNING Path /home/user/wandb/wandb/ wasn't writable, using system temp directory. wandb: WARNING Path /home/user/wandb/wandb/ wasn't writable, using system temp directory wandb: Tracking run with wandb version 0.13.7 wandb: Run data is saved locally in /tmp/wandb/run-20230109_221054-2t7893ns wandb: Run wandb offline to turn off syncing. wandb: Syncing run feature_extraction wandb: ⭐️ View project at https://wandb.ai/bryanwong9095/hipt wandb: 🚀 View run at https://wandb.ai/bryanwong9095/hipt/runs/2t7893ns wandb: WARNING Saving files without folders. If you want to preserve sub directories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt") Loading pretrained weights for ViT_256 model... Error executing job with overrides: [] Traceback (most recent call last): File "extract_features.py", line 158, in main() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main _run_hydra( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra _run_app( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app run_and_report( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report raise ex File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report return func() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in lambda: hydra.run( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "extract_features.py", line 48, in main model = LocalFeatureExtractor( File "/mnt/d/hipt/source/models.py", line 565, in init state_dict = torch.load(pretrain_256, map_location="cpu") File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/serialization.py", line 795, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/serialization.py", line 1002, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) _pickle.UnpicklingError: invalid load key, 'v'. wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: Synced feature_extraction: https://wandb.ai/bryanwong9095/hipt/runs/2t7893ns wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s) wandb: Find logs at: /tmp/wandb/run-20230109_221054-2t7893ns/logs

bryanwong17 commented 1 year ago

data_dir: 'data' output_dir: 'output'

dataset_name: 'camelyon16' experiment_name: 'hipt' slide_list: '${data_dir}/${dataset_name}/slide_list.txt'

region_dir: '/mnt/c/CAMELYON16/seegene/patch_256_resolution_factor_4/train'

resume: False

region_size: 256 format: 'jpg' level: 'local'

pretrain_256: 'checkpoints/vit_256_small_dino.pth' freeze_256: True

pretrain_4096: 'checkpoints/vit_4096_xs_dino.pth' freeze_4096: True

wandb: project: 'hipt' username: 'bryanwong9095' exp_name: 'feature_extraction' dir: '/home/user/wandb' to_log: ['loss', 'auc', 'kappa', 'roc_auc_curve'] group:

clemsgrs commented 1 year ago

in your config file, there needs to be an indent after the wandb key:

wandb:
  project: 'hipt'
  username: 'bryanwong9095'
  exp_name: 'feature_extraction'
  dir: '/home/user/wandb'
  to_log: ['loss', 'auc', 'kappa', 'roc_auc_curve']
  group:

Regarding official checkpoints, you don't need to retrain Transformers using DINO if you use the same img_size and patch_size argument the authors used during self-supervised pre-training (you can find these values in the default.yaml file).

If you want to change one of these arguments, lets say patch_size_4096, then you could still use the pre-trained checkpoints but some weights won't load because of mismatching shapes. Given these weights are often frozen right after loading the checkpoint, this could hamper model performances.

bryanwong17 commented 1 year ago

Hi @clemsgrs , In default.yaml, img_size_256 equals 224. So the actual input images should be 254 or 224 (when extracting patches) given that I want to use official checkpoints

Also, I tried disabling wandb and this is what I got:

output/camelyon16/features/hipt/local already exists! deleting it... done Loading pretrained weights for ViT_256 model... Take key teacher in provided checkpoint dict Pretrained weights found at checkpoints/vit256_small_dino.pth 150 weight(s) loaded succesfully ; 0 weight(s) not loaded because of mismatching shapes Freezing pretrained ViT_256 model Done 367 slides with extracted patches found restricting to 399 slides from slide list .txt file

Slide Encoding: 0%| | 0/399 [00:00<?, ? slide/s] Error executing job with overrides: []
Traceback (most recent call last): File "extract_features.py", line 159, in main() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main _run_hydra( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra _run_app( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app run_and_report( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report raise ex File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report return func() File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in lambda: hydra.run( File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/internal/hydra.py", line 132, in run = ret.return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "extract_features.py", line 138, in main stacked_features = torch.stack(features, dim=0).squeeze(1) RuntimeError: stack expects a non-empty TensorList

clemsgrs commented 1 year ago

Hi, I agree the logic behind img_size_256 and the actual input images size is a bit sketchy (it's inherited from the original HIPT implementation). If you extracted [4096, 4096] regions using HS2P, you shouldn't worry about img_size_* and patch_size_* parameters (use the values in the default.yaml file). If you extracted regions of a different shape, write it down here and I'll explain which img_size_* and patch_size_* values to use.

Regarding your error, you can get a hint at what is happening if you read the output:

367 slides with extracted patches found
restricting to 399 slides from slide list .txt file

This means that 399-367=32 slides are missing patches (probably HS2P found 0 valid patches in these slides). Hence HIPT training is failing because it requires each slide to have at least one patch.

You can either remove these slides from your slide list .txt file. Or you can re-run HS2P on these slides with different parameters in order to have at least 1 valide patch per slide.

bryanwong17 commented 1 year ago

Hi @clemsgrs, ok makes sense. I will re-run HS2P again and make sure the extracted slides have the same amount as in .txt file

Just a small question in the README.md. It mentions that I should look at config/feature_extraction/global or local.yaml. However, I was unable to locate them. Could it be a typo?

Extract region-level features : take a look at config/feature_extraction/global.yaml. Make sure level is set to 'global'.

Extract patch-level features : take a look at config/feature_extraction/local.yaml. Make sure level is set to 'local'.

clemsgrs commented 1 year ago

I must have removed them in a previous commit, sorry for the confusion. Here are the only things that differ when extracting region-level vs. patch-level features:

bryanwong17 commented 1 year ago

Hi @clemsgrs, is the region_size still 4096 for extracting patch-level features?

clemsgrs commented 1 year ago

Yes if that's the patch_size you used when running HS2P

bryanwong17 commented 1 year ago

Hi @clemsgrs, just want to make sure my understanding. So, the only thing I need to do when I run HS2P is to extract 4096 (not 256 and 4096) since pretrained models are already provided. Is it correct?

clemsgrs commented 1 year ago

Yes, you only need to extract 4096 regions & save them to disk as images. Then, HIPT will read these [4096, 4096] regions & automatically divide them in smaller [256, 256] patches.

bryanwong17 commented 1 year ago

Hi @clemsgrs, I am able to run it now, but I have a problem with WSL on my computer. As soon as I fix it, I'll let you know if the code is running properly

image

bryanwong17 commented 1 year ago

Hi @clemsgrs , I managed to extract features now on my computer. Is "local" extracting [M, 256, 384] while "global" extracting [M, 192]? Which one should I use to train HIPT once I've finished extracting both 'local' and 'global'?

From what I understand from original HIPT code, the inputs of training HIPT_LGP_FC are [M,256,384], then we define 'pretrain_4k != None', it would load 'vit4k_xs_dino.pth' and change the dimension to [M. 192]? Then, we set 'freeze_4k=True'?

Is it possible to modify your code to use "cuda:0" for both device_256 and device_4096 (to utilize only 1 gpu)? Willl it impact final performance?

self.device_256 = torch.device("cuda:0") self.device_4096 = torch.device("cuda:0")

Also, Are your checkpoint models the same as those provided by HIPT's authors?

clemsgrs commented 1 year ago

Hi @bryanwong17, glad to hear you managed to extract features. Yes, "local" is extracting [M, 256, 384] while "global" is extracting [M, 192] features.

When it comes to training HIPT using these features, there are 2 ways to go:

We're using 2 GPUs for global feature extraction for memory reasons. You might be able to do with only one GPU. It won't impact performance in any way, but you're likely to run into out of memory issues.

For training, you only need one GPU (regardless of the level at which you work, local or global).

My checkpoint models are the exact same as those provided by HIPT authors.

bryanwong17 commented 1 year ago

Dear @clemsgrs, thank you for your detailed explanation. I really apreciate it.

FYI, I was able to extract 'global' using one GPU RTX 3080 Ti, but I couldn't use my computer otherwise it would turn off automatically :)

Quick questions, can I just follow all hyperparameters in config/training/global.yaml for "global" training? and also the same thing for local?

If I'm not mistaken, is the 'global' final implementation of HIPT?

clemsgrs commented 1 year ago

No problem! Nice one, probably you GPU has more memory than the one I used 👍 You should indeed be able to just follow config/training/global.yaml and config/training/local.yaml files. Here are the parameters you should tailor to your use-case:

num_classes
label_name
level
features_dir
dataset_name
fold_num

the other ones should be good to use as is.

Yes, 'global' matches the actual implementation of HIPT!

bryanwong17 commented 1 year ago

Hi @clemsgrs,

  1. I noticed you set early_stopping.enable = False in both config/training/global.yaml and config/training/local.yaml. Shouldn't it be set to True?

  2. I still don't understand how finetuning/transfer learning on DINO could make HIPT worse. Could you explain it? So far, I only managed to get 0.729 AUC for camelyon16 dataset. Given that the pretrained model was not trained on that dataset and that there were only a small number of slides used in the experiment (train: 270 slides, test: 129 slides), it is likely that the outcome is reasonable.

  3. Why are there no probabilities for some slides when I look at train 0.csv? Is it normal?

slide_id,label,prob_0,prob_1
tumor_051,1,0.6241212487220764,0.37587878108024597
tumor_052,1,0.5511172413825989,0.4488827884197235
normal_056,0,0.45074132084846497,0.5492586493492126
tumor_028,1,0.6926500797271729,0.30734992027282715
normal_068,0,0.7595410346984863,0.24045898020267487
tumor_069,1,,
tumor_079,1,,
tumor_074,1,0.6658058762550354,0.3341941237449646
normal_065,0,0.32616323232650757,0.6738367676734924
normal_135,0,0.5196705460548401,0.4803294241428375
tumor_066,1,0.42442524433135986,0.5755747556686401
normal_049,0,0.7886528968811035,0.21134710311889648
tumor_018,1,,
normal_043,0,,
tumor_063,1,0.34177008271217346,0.6582298874855042
tumor_086,1,0.4489952027797699,0.5510047674179077
tumor_054,1,0.35362136363983154,0.6463786959648132
normal_005,0,0.687300980091095,0.3126990497112274
tumor_025,1,,
tumor_053,1,0.6124722361564636,0.38752782344818115
normal_131,0,,
tumor_085,1,0.43350374698638916,0.5664963126182556
tumor_044,1,,
normal_046,0,,
  1. I can train 'global' perfectly and get the result, but not for 'local', as shown below:
Number of [256,256] patches in [4096,4096] image: 256
Loading pretrained weights for ViT_4096 model...
Take key teacher in provided checkpoint dict
Pretrained weights found at checkpoints/vit4k_xs_dino.pth
77 weight(s) loaded succesfully ; 1 weight(s) not loaded because of mismatching shapes
Total number of parameters: 3388035
Total number of trainable parameters: 3388035
Loading data for fold 0
Training & Tuning on 100% of the data
Error executing job with overrides: []
Traceback (most recent call last):
  File "train.py", line 244, in <module>
    main()
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 140, in main
    verbose=cfg.early_stopping.verbose
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 359, in __getattr__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 819, in format_and_raise
    _raise(ex, cause)
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
    node = self._get_child(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
    child = self._get_node(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
    self._validate_get(key)
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
    self._format_and_raise(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
omegaconf.errors.ConfigAttributeError: Key 'verbose' is not in struct
    full_key: early_stopping.verbose
    object_type=dict
clemsgrs commented 1 year ago

Hi,

  1. you can leave it to False unless you want your model to stop training when early_stopping.tracking has not improved over the last early_stopping.patience epochs.

  2. you can use DINO to self-supervise HIPT pre-training on your new dataset, but then you'll have to implement it (I haven't had time to dive into this yet). In my opinion, you should be good to go by loading the official HIPT pre-trained weights, without pre-training the model yourself. Maybe training on 'local' features will help you get better results as there are more trainable parameters in that case. But given the relatively small size of your dataset, this could also lead to overfitting.

  3. it means not all of your training slides were used during epoch 0. This is because of the weighted sampler. If you really want to have probabilities for all your training slides, you can turn off weighted sampling by setting weighted_sampling: False in your config. If you look at your tuning .csv files, there shouldn't be any missing slide's probabilities given the weighted sampler is only used for training.

  4. did you add verbose=cfg.early_stopping.verbose when initialising EarlyStopping? If so, you'll need to add a verbose parameter to your config file:

early_stopping:
  enable: False
  tracking: 'loss'
  min_max: 'min'
  patience: 10
  min_epoch: 50
  verbose: False

BTW, based on your output, "77 weight(s) loaded succesfully ; 1 weight(s) not loaded because of mismatching shapes". I realised that the parameter img_size_4096 is set to 4096 in the local.yaml config file, which is causing this warning. You should change it to img_size_4096: 3584. Sorry for that!

bryanwong17 commented 1 year ago

Yes, I did. Indeed, I think the main problem when training 'local' is that there is 1 weight which not loaded properly

Number of [256,256] patches in [4096,4096] image: 256
Loading pretrained weights for ViT_4096 model...
Take key teacher in provided checkpoint dict
Pretrained weights found at checkpoints/vit4k_xs_dino.pth
77 weight(s) loaded succesfully ; 1 weight(s) not loaded because of mismatching shapes
Total number of parameters: 3388035
Total number of trainable parameters: 3388035
Loading data for fold 0
Training & Tuning on 100% of the data
Error executing job with overrides: []
clemsgrs commented 1 year ago

Exactly, I edited my answer above in the mean time!

bryanwong17 commented 1 year ago

Hi @clemsgrs, thank you for your detailed response. I do have a follow-up question for DINO, especially for this part you should be good to go by loading the official HIPT pre-trained weights, without pre-training the model yourself. Do you mean I can resume training DINO from official HIPT pre-trained weights? However, I am not sure how many epochs I should train for DINO if I implement and train it myself. Any ideas?

clemsgrs commented 1 year ago

I meant you can train HIPT on CAMELYON16 data by loading pre-trained weights provided by the authors (without needing to pre-train using DINO).

If you want to pre-train using DINO, then you can start from scratch (i.e. you don't need the pre-trained weights provided by the authors). Regarding the number of epochs required for pre-training, I'd suggest looking at HIPT paper again (they might explain how many epochs they pre-trained the model for) & the official repo's Issues (maybe someone asked the same question). Otherwise, I assume DINO has a way to track when pre-training has reached an endpoint.

bryanwong17 commented 1 year ago

Hi @clemsgrs, ok thanks for the confirmation. Anyway, I tried to run 'local' again, but now it stopped again in the middle of training as shown below:

Number of [256,256] patches in [3584,3584] image: 196
Loading pretrained weights for ViT_4096 model...
Take key teacher in provided checkpoint dict
Pretrained weights found at checkpoints/vit4k_xs_dino.pth
78 weight(s) loaded succesfully ; 0 weight(s) not loaded because of mismatching shapes
Total number of parameters: 3376515
Total number of trainable parameters: 3376515
Loading data for fold 0
Training & Tuning on 100% of the data
Train - Epoch 1:  30%|██████              | 75.0/248 [00:14<00:27, 6.28 slide/s]
Error executing job with overrides: []
Traceback (most recent call last):
  File "train.py", line 244, in <module>
    main()
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 151, in main
    train_results = train(
  File "/mnt/d/hipt/source/utils.py", line 407, in train
    for i, batch in enumerate(t):
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/tqdm/_tqdm.py", line 1032, in __iter__
    for obj in iterable:
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/d/hipt/source/dataset.py", line 61, in __getitem__
    features = torch.load(fp)
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/serialization.py", line 777, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/home/bryan/anaconda3/envs/hipt_try/lib/python3.8/site-packages/torch/serialization.py", line 282, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
clemsgrs commented 1 year ago

Hi, never ran into such an error. It could be a problem with the PyTorch version and saving mechanism. Maybe try updating PyTorch & re-generate local features with extract_feature.py.

bryanwong17 commented 1 year ago

Hi @clemsgrs, there are questions I have regarding the implementation of HIPT. What if the image size was [224,224] instead of [256,256]? What should I modify in the 2nd stage? Is the dimension of extracted patches [3584, 3584], not [4096,4096]? What about the input of DINO's 2nd stage? Is it still [256,384]?

clemsgrs commented 1 year ago

if you pull the latest version of master you should be able to use the true image size, that is region_size = 4096 if you're using [4096, 4096] regions (and no longer have to use 3584)

bryanwong17 commented 1 year ago

Hi @clemsgrs, thanks for the explanation. Could you please explain why the train_batch_size is only 1? Is it because we could have a different number of regions per slide? Is it possible to use batch size more than 1?

clemsgrs commented 1 year ago

Hi, as far as I understand yes, we're using train_batch_size = 1 mainly because different slides have different number of regions. Doing so, we don't need to worry about having to pad batched sequences to a given common length.

In theory, if you take care of padding sequences, you should be able to use bigger train batch sizes when training on global features. However, when working with local features, I think it's quite challenging to train with bigger batch sizes given the number of region per slide acts as the effective batch size in such cases. Having a batch size > 1 would add an extra dimension to the objects that are being manipulated under the hood: I'm not sure how one could make it work smoothly.

bryanwong17 commented 1 year ago

Hi, actually, I was also thinking about padding the sequences to make them the same shape. However, I do notice that it might not be good, especially when we pad the slide with the smallest number of regions with the slide with the greatest number of regions. Do you have any ideas on how to solve this problem? or use a special padding technique? My concern is that training with train_batch_size=1 could lead to overfitting/underfitting and unstable training.

clemsgrs commented 1 year ago

You can pad mini-batch wise instead of dataset-wise, that is write a custom collate_fn function which pads the current mini-batch using the biggest sequence length from that mini-batch.

I might try to implement it, will give an update here if I ever do so.