NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.45k stars 2.39k forks source link

RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel after ANY fine tuning #9685

Closed nzsimonc closed 2 weeks ago

nzsimonc commented 1 month ago

Describe the bug I can successfully run my inference code on the default megamolbart.nemo, but as soon as I run any kind of fine tuning on it then I get the error RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel. I've attached code, config and output for both finetune and infer stages.

My desired outcome is to use the finetuning process to add in some of my own data, then call the inference code to get the SMILES string and the prediction.

Steps/Code to reproduce bug 1. FineTune - code, config and output Actually I get the same issue if I take trainer.fit(model) out of the code as well. My Fine Tune Code (company1.6_finetune_donothing.py.txt) just loads the default megamolbart.nemo using finetune_config.yaml.txt (doesn't alter anything if restore_from_path is filled in or not). The Result (company1.6_finetune_donothing_RESULT.txt) seems fine. company1.6_finetune_donothing.py.txt finetune_config.yaml.txt company1.6_finetune_donothing_RESULT.txt

2. Infer - code, config and output I then run my infer code (company1.6_infer.py.txt) with the default infer.yaml file (infer.yaml.txt) with just the restore_from_path pointing to the file created in Step 1. NOTE: as can be seen in the 'Expected behavior' section, if I use the default megamolbart.nemo file then it all works as expected. company1.6_infer.py.txt infer.yaml.txt company1.6_infer_RESULT.txt

TLDR; This is my error: [NeMo I 2024-07-11 01:57:35 regex_tokenizer:254] Loading regex from file = /workspace/bionemo/tokenizers/molecule/megamolbart/vocab/megamolbart.model [NeMo I 2024-07-11 01:57:35 megatron_base_model:315] Padded vocab_size: 640, original vocab_size: 523, dummy tokens: 117. [NeMo W 2024-07-11 01:57:35 megatron_lm_encoder_decoder_model:240] Could not find encoder or decoder in config. This is probably because of restoring an old checkpoint. Copying shared model configs to encoder and decoder configs. [NeMo W 2024-07-11 01:57:35 megatron_lm_encoder_decoder_model:206] bias_gelu_fusion is deprecated. Please use bias_activation_fusion instead. [NeMo W 2024-07-11 01:57:35 megatron_lm_encoder_decoder_model:206] bias_gelu_fusion is deprecated. Please use bias_activation_fusion instead. Traceback (most recent call last): File "/workspace/bionemo/examples/molecule/megamolbart/company1.6_infer.py", line 59, in inferer = load_model_for_inference(cfg, interactive=True) File "/workspace/bionemo/bionemo/triton/utils.py", line 238, in load_model_for_inference model = infer_class(cfg, interactive=interactive, **kwargs) File "/workspace/bionemo/bionemo/model/molecule/infer.py", line 40, in init super().init( File "/workspace/bionemo/bionemo/model/core/infer.py", line 468, in init super().init( File "/workspace/bionemo/bionemo/model/core/infer.py", line 146, in init self.model = self.load_model(cfg, model=model, restore_path=restore_path, strict=strict_restore_from_path) File "/workspace/bionemo/bionemo/model/core/infer.py", line 206, in load_model model = restore_model( File "/workspace/bionemo/bionemo/model/utils.py", line 363, in restore_model model = model_cls.restore_from( File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from return super().restore_from( File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 442, in restore_from instance = cls._save_restore_connector.restore_from( File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 751, in restore_from super().load_instance_with_state_dict(instance, state_dict, strict) File "/usr/local/lib/python3.10/dist-packages/nemo/core/connectors/save_restore_connector.py", line 203, in load_instance_with_state_dict instance.load_state_dict(state_dict, strict=strict) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 447, in load_state_dict results = super(NLPModel, self).load_state_dict(state_dict, strict=strict) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: Missing key(s) in state_dict: "enc_dec_model.encoder_embedding.word_embeddings.weight", "enc_dec_model.encoder_embedding.position_embeddings.weight", "enc_dec_model.decoder_embedding.word_embeddings.weight", "enc_dec_model.decoder_embedding.position_embeddings.weight", "enc_dec_model.enc_dec_model.encoder.model.layers.0.input_layernorm.weight", "enc_dec_model.enc_dec_model.encoder.model.layers.0.input_layernorm.bias", "enc_dec_model.enc_dec_model.encoder.model.layers.0.self_attention.query_key_value.weight", "enc_dec_model.enc_dec_model.encoder.model.layers.0.self_attention.query_key_value.bias", "enc_dec_model.enc_dec_model.encoder.model.layers.0.self_attention.dense.weight",

Expected behavior

Run the Inference code using the default downloadable megamolbart.nemo and it works fine as can be seen here: company1.6_infer_RESULT_DEFAULT_NEMO.txt aka we can get simple things like Reconstructed SMILES: from the system.

Environment overview (please complete the following information)

Environment details NVIDIA docker image is used

Additional context Azure T4 GPU

nzsimonc commented 1 month ago

I have also run 'pytest test_megamolbart_triton.py' pointing that to my 'fine tuned' .nemo file and it gives me the same.

======================================================================= short test summary info ======================================================================= ERROR test_megamolbart_triton.py::test_seq_to_embedding_triton - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_seq_to_hidden_triton - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_hidden_to_seqs_triton - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_samplings_triton - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_seq_to_embedding_direct - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_seq_to_hidden_direct - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_hidden_to_seqs_direct - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ERROR test_megamolbart_triton.py::test_samplings_direct - RuntimeError: Error(s) in loading state_dict for MegaMolBARTModel: ========================================================================= 8 errors in 27.01s

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 7 days since being marked as stale.