ValueError: Unexpected keyword arguments: `compute_on_step`

LYJ0327 commented 10 months ago

Hi, Hello, I'm very sorry to bother you, but I would like to ask why my program reports the following error whether I run test or train: Traceback (most recent call last): File "/home/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 54, in main submit(args=args, stages_fnc=stages_fnc) File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 69, in submit stages_fnc(args) File "/home/cvt2distilgpt2/stages.py", line 68, in stages model = TaskModel(**vars(args)) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2distilgpt2/cvt2distilgpt2_iu_x_ray_chen.py", line 66, in init self.val_coco_metrics = COCOCaptionMetrics(metrics=["bleu", "cider", "rouge"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2distilgpt2/tools/metrics/coco.py", line 33, in init super().init(dist_sync_on_step=dist_sync_on_step, compute_onstep=False) File "/home/cvt2/lib/python3.11/site-packages/torchmetrics/metric.py", line 146, in init raise ValueError(f"Unexpected keyword arguments: {', '.join(kwargs)}") ValueError: Unexpected keyword arguments: compute_on_step I would be very grateful if you could reply me with a solution to my problem!

anicolson commented 10 months ago

Hi, its due to an update to the torchmetrics package. I'll amend it now.

LYJ0327 commented 10 months ago

I'm very sorry, I solved that problem, but I'm having a new problem when I'm testing it： Traceback (most recent call last): File "/home/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 54, in main submit(args=args, stages_fnc=stages_fnc) File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 69, in submit stages_fnc(args) File "/home/cvt2distilgpt2/stages.py", line 104, in stages trainer.test(model) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run results = self._run_stage() ^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage return self._evaluation_loop.run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 134, in run self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 391, in _evaluation_step output = call._call_strategy_hook(trainer, hook_name, step_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 416, in test_step return self.lightning_module.test_step(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2distilgpt2/cvt2distilgpt2_mimic_cxr_chen.py", line 472, in test_step output_ids = self.generate(self.num_test_beams, batch['encoder_images']) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2distilgpt2/cvt2distilgpt2_mimic_cxr_chen.py", line 390, in generate outputs = self.decoder.encoder_decoder.generate( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/transformers/generation/utils.py", line 1231, in generate self._validate_model_kwargs(model_kwargs.copy()) File "/home/cvt2/lib/python3.11/site-packages/transformers/generation/utils.py", line 1109, in _validate_model_kwargs raise ValueError( ValueError: The following model_kwargs are not used by the model: ['special_token_ids', 'mask_token_id'] (note: typos in the generate arguments will also show up in this list) Testing DataLoader 0: 0%| | 0/148 [00:01<?, ?it/s]

anicolson commented 10 months ago

Okay,

Are you attempting:

dlhpcstarter -t mimic_cxr_chen -c config/test_mimic_cxr_chen_cvt2distilgpt2.yaml --stages_module stages --test

?

And what version of transformers are you running?

I'll look into it now.

LYJ0327 commented 10 months ago

no,I test on iu_xray with the commend dlhpcstarter -t iu_x_ray_chen -c config/test_iu_x_ray_chen_cvt2distilgpt2.yaml --stages_module stages --test and the version of my transformer is 4.28.1

anicolson commented 10 months ago

Okay. I updated a few things and It now works with the latest transformers (4.35.2). Please pull the repo.

Also, you might be interested in https://huggingface.co/aehrc/cxrmate.

LYJ0327 commented 10 months ago

I'm sorry, but I still can't test it Traceback (most recent call last): File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run results = self._run_stage() ^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage return self._evaluation_loop.run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run return self.on_run_end() ^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end self._on_evaluation_epoch_end() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end call._call_lightning_module_hook(trainer, hook_name) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cvt2gpt/cvt2distilgpt2/cvt2distilgpt2_mimic_cxr_chen.py", line 498, in on_test_epoch_end output = self.test_chexbert_metrics.compute() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/torchmetrics/metric.py", line 607, in wrapped_func value = _squeeze_if_scalar(compute(args, **kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2gpt/cvt2distilgpt2/tools/metrics/chexbert.py", line 65, in compute chexbert = CheXbert( ^^^^^^^^^ File "/home/cvt2gpt/cvt2distilgpt2/tools/chexbert.py", line 13, in init self.tokenizer = BertTokenizer.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2008, in from_pretrained raise EnvironmentError( OSError: Can't load tokenizer for 'checkpoints/bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'checkpoints/bert-base-uncased' is the correct path to a directory containing all relevant files for a BertTokenizer tokenizer.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 54, in main submit(args=args, stages_fnc=stages_fnc) File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 69, in submit stages_fnc(args) File "/home/cvt2gpt/cvt2distilgpt2/stages.py", line 104, in stages trainer.test(model) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt trainer._teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _teardown self.strategy.teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 405, in teardown super().teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/parallel.py", line 127, in teardown super().teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 528, in teardown self.lightning_module.cpu() File "/home/cvt2/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 78, in cpu self.__update_properties(device=torch.device("cpu")) File "/home/cvt2/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 112, in update_properties self.apply(apply_fn) File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 896, in apply for module in self.children(): File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2284, in children for name, module in self.named_children(): File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2304, in named_children if module is not None and module not in memo: ^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/torchmetrics/metric.py", line 918, in hash__ return hash(tuple(hash_vals)) ^^^^^^^^^^^^^^^^^^^^^^ TypeError: unhashable type: 'list' Testing DataLoader 0: 100%|██████████| 148/148 [11:35<00:00, 0.21it/s]

and I wonder why I test in iu_xray dataset:dlhpcstarter -t iu_x_ray_chen -c config/test_iu_x_ray_chen_cvt2distilgpt2.yaml --stages_module stages --test but the mimic file appears in the error report: File "/home/lhx/lyj/cvt2gpt/cvt2distilgpt2/cvt2distilgpt2_mimic_cxr_chen.py", line 498, in on_test_epoch_end output = self.test_chexbert_metrics.compute()

anicolson commented 10 months ago

Hi, this has been fixed. Please pull again. Please let me know if there are any more issues.

The error was there because the IU X-Ray LightningModule inherits from the MIMIC-CXR LightningModule.

LYJ0327 commented 10 months ago

I'm sorry that this issue has not been resolved,I have read his error message and it seems that it needs to load a tokenizer named 'bert-base-uncased' from the bert_path, but this document is not included in your project Traceback (most recent call last): File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 794, in _test_impl results = self._run(model, ckpt_path=ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run results = self._run_stage() ^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage return self._evaluation_loop.run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator return loop_run(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 141, in run return self.on_run_end() ^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 253, in on_run_end self._on_evaluation_epoch_end() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end call._call_lightning_module_hook(trainer, hook_name) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/cvt2gpt/cvt2distilgpt2/cvt2distilgpt2_mimic_cxr_chen.py", line 498, in on_test_epoch_end output = self.test_chexbert_metrics.compute() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/torchmetrics/metric.py", line 607, in wrapped_func value = _squeeze_if_scalar(compute(args, **kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyj/cvt2gpt/cvt2distilgpt2/tools/metrics/chexbert.py", line 65, in compute chexbert = CheXbert( ^^^^^^^^^ File "/home/lyj/cvt2gpt/cvt2distilgpt2/tools/chexbert.py", line 16, in init self.tokenizer = BertTokenizer.from_pretrained(bert_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyj/cvt2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2008, in from_pretrained raise EnvironmentError( OSError: Can't load tokenizer for 'bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'bert-base-uncased' is the correct path to a directory containing all relevant files for a BertTokenizer tokenizer.

And there is another issue here： During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 54, in main submit(args=args, stages_fnc=stages_fnc) File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 69, in submit stages_fnc(args) File "/home/cvt2gpt/cvt2distilgpt2/stages.py", line 104, in stages trainer.test(model) File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 754, in test return call._call_and_handle_interrupt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt trainer._teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _teardown self.strategy.teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 405, in teardown super().teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/parallel.py", line 127, in teardown super().teardown() File "/home/cvt2/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 528, in teardown self.lightning_module.cpu() File "/home/cvt2/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 78, in cpu self.__update_properties(device=torch.device("cpu")) File "/home/cvt2/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 112, in update_properties self.apply(apply_fn) File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 896, in apply for module in self.children(): File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2284, in children for name, module in self.named_children(): File "/home/cvt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2304, in named_children if module is not None and module not in memo: ^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/torchmetrics/metric.py", line 918, in hash__ return hash(tuple(hash_vals)) ^^^^^^^^^^^^^^^^^^^^^^ TypeError: unhashable type: 'list' Testing DataLoader 0: 100%|██████████| 148/148 [02:53<00:00, 0.85it/s]

anicolson commented 10 months ago

Oh no.

Can you please check if you can run the following in your environment:

BertTokenizer.from_pretrained('bert-base-uncased')

If you cannot, you likely have issues accessing Hugging Face hub/have a firewall preventing this.

LYJ0327 commented 10 months ago

Hi,I'm so sorry for disturbing you again... But whether I test or train, it's always reporting errors. I want to know if there's a problem with the dlhpcstarter： Traceback (most recent call last): File "/home/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 41, in main stages_fnc = importer(definition=args.stages_definition, module=args.stages_module) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cvt2/lib/python3.11/site-packages/dlhpcstarter/utils.py", line 47, in importer assert os.path.isfile(path), f'{path} does not exist. The target definition and modules: {definition} & {module}.' AssertionError: stages.py does not exist. The target definition and modules: stages & stages.

anicolson commented 10 months ago

Hi,

What is your current working directory and where is the absolute path of stages.py?

LYJ0327 commented 10 months ago

Oh, I made a stupid mistake. I understand what you mean and have already solved this problem. Thank you for your patience!

anicolson commented 10 months ago

That's okay 😄

LYJ0327 commented 9 months ago

I'm very sorry to bother you, but when I experiment with the mimic dataset, it reports the following error: (cvt2) (base) lhx@test:~/lyj/cvt2gpt/cvt2distilgpt2$ dlhpcstarter -t mimic_cxr_chen -c config/test_mimic_cxr_chen_cvt2distilgpt2.yaml --stages_module stages --test args: {'task': 'mimic_cxr_chen', 'config': 'config/test_mimic_cxr_chen_cvt2distilgpt2', 'exp_dir': 'experiments', 'work_dir': '/home/lhx/lyj/cvt2gpt/cvt2distilgpt2', 'dataset_dir': '/data/lhx/lyj/data', 'ckpt_zoo_dir': 'checkpoints', 'definition': 'CvT2DistilGPT2', 'module': 'cvt2distilgpt2_mimic_cxr_chen', 'stages_definition': 'stages', 'stages_module': 'stages', 'train': None, 'trial': 0, 'resume_last': True, 'resume_epoch': None, 'resume_ckpt_path': None, 'warm_start_ckpt_path': None, 'monitor': 'val_chen_cider', 'monitor_mode': 'max', 'test': True, 'test_epoch': None, 'test_ckpt_path': 'checkpoints/mimic_cxr_jpg_chen/cvt_21_to_distilgpt2/epoch=8-val_chen_cider=0.425092.ckpt', 'fast_dev_run': None, 'num_workers': 5, 'devices': 1, 'num_nodes': 1, 'memory': None, 'time_limit': None, 'submit': None, 'qos': None, 'begin': None, 'slurm_cmd_path': None, 'email': None, 'cuda_visible_devices': None, 'venv_path': None, 'config_file_name': 'config/test_mimic_cxr_chen_cvt2distilgpt2.yaml', 'config_name': 'test_mimic_cxr_chen_cvt2distilgpt2', 'config_dir': '/home/lhx/lyj/cvt2gpt/cvt2distilgpt2/config', 'config_full_path': '/home/lhx/lyj/cvt2gpt/cvt2distilgpt2/config/test_mimic_cxr_chen_cvt2distilgpt2.yaml', 'strategy': 'ddp_find_unused_parameters_true', 'encoder_lr': 5e-05, 'decoder_lr': 0.0005, 'mbatch_size': 4, 'every_n_epochs': 1, 'precision': 16, 'decoder_max_len': 128, 'num_test_beams': 4, 'enable_progress_bar': True, 'weights_summary': 'full', 'early_stopping': True, 'patience': 10, 'min_delta': 0.0001, 'deterministic': False, 'exp_dir_trial': 'experiments/mimic_cxr_chen/test_mimic_cxr_chen_cvt2distilgpt2/trial_0'} Seed set to 0 Traceback (most recent call last): File "/home/lhx/lyj/cvt2/bin/dlhpcstarter", line 8, in sys.exit(main()) ^^^^^^ File "/home/lhx/lyj/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 54, in main submit(args=args, stages_fnc=stages_fnc) File "/home/lhx/lyj/cvt2/lib/python3.11/site-packages/dlhpcstarter/main.py", line 69, in submit stages_fnc(args) File "/home/lhx/lyj/cvt2gpt/cvt2distilgpt2/stages.py", line 29, in stages TaskModel = importer(definition=args.definition, module=args.module) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lhx/lyj/cvt2/lib/python3.11/site-packages/dlhpcstarter/utils.py", line 51, in importer return getattr(module, definition) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'cvt2distilgpt2_mimic_cxr_chen' has no attribute 'CvT2DistilGPT2' Do you know the reason for this please?

anicolson commented 9 months ago

Hi,

The class name in 'cvt2distilgpt2_mimic_cxr_chen' is 'CvT2DistilGPT2MIMICXRChen' rather than 'CvT2DistilGPT2'.

I think the error is here: https://github.com/aehrc/cvt2distilgpt2/blob/main/config/test_mimic_cxr_chen_cvt2distilgpt2.yaml, I will update this, sorry about that.

LYJ0327 commented 9 months ago

Hello, thank you for your previous help and I am very sorry to bother you again. I used the checkpoint you provided and BLEU4 reached 0.189, this is great! But when I use your training command dlhpcstarter -t mimic_cxr -c config/train_mimic_cxr_chen_cvt2distilgpt2.yaml --stages_module stages --train --test to train the network from scratch, the BLEU4 is only 0.046 at epoch 30, 0.05467 at epoch 50, and 0.0453 at epoch 100. The difference between them is very big, 10 percentage points. Is it because the card I am using is 4090 or is it the number of epochs I have set or some other issue? I would be very happy if you could reply to me and thank you for your patience.

anicolson commented 9 months ago

Hi,

That is concerning. I am training with the same config (dlhpcstarter -t mimic_cxr -c config/train_mimic_cxr_chen_cvt2distilgpt2.yaml --stages_module stages --train --test) to see if the issue occurs for me. After two epochs, the validation scores are as follows:

epoch	step	train_loss_step	val_ce_precision_example	val_chen_rouge	val_chen_cider	val_ce_f1_macro	val_ce_num_examples	val_ce_recall_macro	val_chen_bleu_2	val_chen_bleu_1	val_ce_recall_example	val_chen_bleu_3	val_ce_precision_micro	val_ce_f1_micro	val_ce_recall_micro	val_chen_bleu_4	val_ce_f1_example	val_chen_num_examples	val_ce_precision_macro	train_loss_epoch
0	16924		0.40693270735524256	0.31695191260859684	0.2920809223790337	0.20219941561024884	2130.0	0.20641742878825722	0.22258831560611725	0.3392559885978699	0.3729107981220657	0.15715651214122772	0.454778469425119	0.414069011501917	0.38004895960832313	0.1176675483584404	0.3744735077129443	2130.0	0.2957008008496724
1	33849		0.42745696400625977	0.32013704564588763	0.3445314092814959	0.24549816758435877	2130.0	0.24682623539299464	0.23871149122714996	0.3632817566394806	0.3982863849765258	0.16974632441997528	0.46675805346127486	0.44034917555771097	0.4167686658506732	0.12679457664489746	0.396732617929801	2130.0	0.3056663869771598

Which seems normal. I will keep training it and update you with the results.

In train_iu_x_ray_chen_cvt2distilgpt2.yaml, I had set early stopping, so no need for setting a max number of epochs:

early_stopping: True
patience: 10
min_delta: 1e-4

It did not require that many epochs, for example, if you look here: https://data.csiro.au/collection/csiro%3A53728v5 , epoch 8 was the best epoch.

I doubt the card you are using is causing this much of a difference.

Could you please comment with the output of pip list? I am trying to think of what other issues there could be in the meantime.

anicolson commented 9 months ago

Here are the validation scores after 9 epochs:

epoch	step	val_ce_precision_example	val_chen_rouge	val_chen_cider	val_ce_f1_macro	val_ce_num_examples	val_ce_recall_macro	val_chen_bleu_2	val_chen_bleu_1	val_ce_recall_example	val_chen_bleu_3	val_ce_precision_micro	val_ce_f1_micro	val_ce_recall_micro	val_chen_bleu_4	val_ce_f1_example	val_chen_num_examples	val_ce_precision_macro
0	16924	0.40693270735524256	0.31695191260859684	0.2920809223790337	0.20219941561024884	2130.0	0.20641742878825722	0.22258831560611725	0.3392559885978699	0.3729107981220657	0.15715651214122772	0.454778469425119	0.414069011501917	0.38004895960832313	0.1176675483584404	0.3744735077129443	2130.0	0.2957008008496724
1	33849	0.42745696400625977	0.32013704564588763	0.3445314092814959	0.24549816758435877	2130.0	0.24682623539299464	0.23871149122714996	0.3632817566394806	0.3982863849765258	0.16974632441997528	0.46675805346127486	0.44034917555771097	0.4167686658506732	0.12679457664489746	0.396732617929801	2130.0	0.3056663869771598
2	50774	0.4493818466353678	0.32215001117525616	0.3491774795545655	0.24494992327325998	2130.0	0.2451831928520513	0.23824720084667206	0.360444575548172	0.40607198748043816	0.1692574918270111	0.4796195652173913	0.45460399227301995	0.4320685434516524	0.12665951251983643	0.4104650122959982	2130.0	0.33146890807093127
3	67699	0.419679186228482	0.31870183849585365	0.3268155641993095	0.24201441861553305	2130.0	0.23874382163185287	0.23246432840824127	0.3520020842552185	0.38953834115805946	0.16601967811584473	0.5027644673792849	0.4561110182243772	0.4173806609547124	0.12430074065923691	0.3892860868917207	2130.0	0.30884301315553314
4	84624	0.421471048513302	0.320114059908859	0.3533501690062039	0.26324989177251074	2130.0	0.25269686966174104	0.23336923122406006	0.3527657985687256	0.3964397496087637	0.16647367179393768	0.48460176991150444	0.44936812735926474	0.41891064871481026	0.12536273896694183	0.3942397482538328	2130.0	0.3413789053425755
5	101549	0.44420523138832996	0.32870873072950774	0.3749406450696774	0.28741378950783564	2130.0	0.28123585476076024	0.25293877720832825	0.3823215067386627	0.42827856025039124	0.18067187070846558	0.5026737967914439	0.4805111821086262	0.4602203182374541	0.13547103106975555	0.4188216504413687	2130.0	0.34597945551478443
6	118474	0.4521987480438185	0.3252632081509904	0.3583214345964929	0.27354281517517703	2130.0	0.2692128588238088	0.24576494097709656	0.3704437017440796	0.4243348982785602	0.17543305456638336	0.4955902306648575	0.4700772200772201	0.447062423500612	0.1316041350364685	0.4206494522691706	2130.0	0.3183531138289372
7	135399	0.43258215962441315	0.3088185459463964	0.3243224346348287	0.27673346937544413	2130.0	0.2682169390229497	0.22527024149894714	0.34522631764411926	0.41045383411580594	0.15802665054798126	0.4795613160518445	0.45977377728214114	0.4415544675642595	0.1160745918750763	0.40617706237424545	2130.0	0.35257348143859435
8	152324	0.43335680751173705	0.3199401416458053	0.3724532987339018	0.2943421836128386	2130.0	0.2802859326143582	0.2422061413526535	0.37223392724990845	0.4196009389671361	0.17064839601516724	0.4915364583333333	0.47634069400630913	0.4620563035495716	0.12724152207374573	0.40869103509948584	2130.0	0.36754727023553097

@LYJ0327, how do these validation scores compare to yours?

LYJ0327 commented 8 months ago

I'm very sorry for my late reply, I did both training and testing of your code on a 4090 gpu and here are my results, which still differ from yours. $D3B$FL@3%}(AE_ {}2Q) (M$ ps, I used the pth file you provided for my test experiments on both datasets

anicolson commented 8 months ago

Hi LYJ0327,

Could you please share your validation metrics? E.g., experiments/mimic_cxr/train_mimic_cxr_chen_cvt2distilgpt2/trial_0/lightning_logs/version_6/metrics.csv.

LYJ0327 commented 8 months ago

Y{3FM7918`VWZ3 %@~P R4T I'd love to share the relevant metric with you, but I don't seem to have generated the relevant file in my folder.

anicolson commented 8 months ago

try version_0? Usually one is created for .csv files and the following one is created for tensorboard

LYJ0327 commented 8 months ago

Ok, I downloaded the two csv files as shown below. mimic_metrics.csv iu_xray_metrics.csv Secondly, I just tested the effect of the pth file you gave me on A100 and the result is shown in the picture. ${357BAO~8~SRF9%JQH{_9K9$ However when I tried to train your code on A100, he reported the following error: Do you know what is the cause of this please? I suspect that it is caused by my pycocotools not being installed according to the version specified in your REQUIREMENT, when I try to install pycocotools==2.0.4, the system reports the following error: OHL(J{8 8}KRC4O2DOR}N0U So I have installed pycocotools version 2.0.7. do you know if this is the reason why I can't run the train code or is it something else?

anicolson commented 8 months ago

Hi LYJ0327,

Your validation and test scores for MIMIC-CXR seem as expected, differences caused by seed, etc. There is a high variability in the score over many training runs, e.g., see the variance in BLEU-4 scores in Figure 7 for the training runs of each model: https://doi.org/10.1016/j.artmed.2023.102633

Here are the results from the new model I trained a week ago from a few comments ago (with epoch 8) (which are similar to your test results from mimic_metrics.csv:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test_ce_f1_example │ 0.3509851720193866 │ │ test_ce_f1_macro │ 0.25575093165688123 │ │ test_ce_f1_micro │ 0.42572590826202833 │ │ test_ce_num_examples │ 3858.0 │ │ test_ce_precision_example │ 0.41046742699153277 │ │ test_ce_precision_macro │ 0.33392767167746434 │ │ test_ce_precision_micro │ 0.484669434685404 │ │ test_ce_recall_example │ 0.3443025006788615 │ │ test_ce_recall_macro │ 0.24629913305384998 │ │ test_ce_recall_micro │ 0.3795647823911956 │ │ test_chen_bleu_1 │ 0.3907630145549774 │ │ test_chen_bleu_2 │ 0.24406647682189941 │ │ test_chen_bleu_3 │ 0.1676843911409378 │ │ test_chen_bleu_4 │ 0.12317755818367004 │ │ test_chen_cider │ 0.34302342164748734 │ │ test_chen_meteor │ 0.15075385570526123 │ │ test_chen_num_examples │ 3858.0 │ │ test_chen_rouge │ 0.28515253454563666 │ └───────────────────────────┴───────────────────────────┘

pycocotools requires gcc for installation.

LYJ0327 commented 8 months ago

Ok, thanks, I got the results of your experiment. But my other question is, I already have gcc, why I still can't install pycocotools version 2.0.7.

anicolson commented 8 months ago

Hi LYJ0327,

Unfortunately, pycocotools is not my package, maybe the authors of this package will be better able to help you with this.

Sorry I couldn't help you better with this.

LYJ0327 commented 8 months ago

Hi, thank you very, very much for your patience and help. You're so kind! I've been able to run through the whole code, but I'm having some problems understanding your code, can you please give me a contact information for WeChat or any other software? I would like to ask you some details about your code implementation. I would be very, very grateful if you could be patient and help me！

anicolson commented 8 months ago

Hi LYJ0327,

Which part is hard to understand?

LYJ0327 commented 8 months ago

hi, thanks for the reply. I am having trouble understanding how the network encodes the token as embedding when doing text decoder? And what part of the code should I understand if I want to use embedding directly for decoding?

anicolson commented 8 months ago

Hi,

Hugging Face tokenises is used for tokenisation: https://huggingface.co/learn/nlp-course/en/chapter2/4

The Hugging Face decoder converts the tokens into embeddings using the torch.nn.Embedding class: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

Embeddings can be given as input to a Hugging Face decoder via: https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertLMHeadModel.forward.inputs_embeds

Hope that helps

LYJ0327 commented 8 months ago

Thank you for the help you provided, I know you used these functions to do the corresponding work, but I am more interested in knowing which part of the source code that you provided use these tools of huggingface; secondly, may I ask how the ce metrics in your paper are calculated? Could you provide the corresponding code for calculating the ce metrics?

thanks!

aehrc / cvt2distilgpt2

ValueError: Unexpected keyword arguments: `compute_on_step` #14