Closed Lzcstan closed 3 months ago
Hi @Lzcstan,
I guess this might be due to the cuda and pytorch version. My cuda version is 11 (and pytorch-1.9). Can you try to downgrade them?
Hi @Lzcstan,
I guess this might be due to the cuda and pytorch version. My cuda version is 11 (and pytorch-1.9). Can you try to downgrade them?
But I guess NVIDIA H800
could not use cuda<12
, I tried cuda==11.3
and pytorch==1.10.1
but get the follow error:
[2024-01-10 03:20:35,380] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/oem/anaconda3/envs/mol_stm/lib/python3.7/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
arguments Namespace(CL_neg_samples=1, JK='last', SSL_emb_dim=256, SSL_loss='EBM_NCE', T=0.1, batch_size=32, dataset='PubChemSTM', dataspace_path='../data', decay=0, device=0, dropout_ratio=0.5, epochs=2, gnn_emb_dim=300,
gnn_type='gin', graph_pooling='mean', max_seq_len=512, megamolbart_input_dir='../data/pretrained_MegaMolBART/checkpoints', mol_lr=1e-05, mol_lr_scale=1, molecule_type='SMILES', normalize=True, num_layer=5, num_workers=8,
output_model_dir=None, pretrain_gnn_mode='GraphMVP_G', representation_frozen=False, seed=42, text_lr=0.0001, text_lr_scale=1, text_type='SciBERT', verbose=True, vocab_path='../MoleculeSTM/bart_vocab.txt')
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/home/oem/anaconda3/envs/mol_stm/lib/python3.7/site-packages/torch/cuda/__init__.py:143: UserWarning:
NVIDIA H800 with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 sm_80 sm_86 compute_37.
If you want to use the NVIDIA H800 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Here is the support for my guess.
Hi @Lzcstan,
I am afraid that I don't have H800 to check the code, and according to the exception messages you listed above, it is the incompatibility between H800 (cuda and pytorch) and megatron.
cd MolBART/megatron_molbart/Megatron-LM-v1.1.5-3D_parallelism
pip install .
pip install megatron-lm
, and another information that might be helpful is to print out the state_dict['cuda_rng_state']
before the line torch.cuda.set_rng_state(state_dict['cuda_rng_state'])
in the source code.Hi @Lzcstan,
I am afraid that I don't have H800 to check the code, and according to the exception messages you listed above, it is the incompatibility between H800 (cuda and pytorch) and megatron.
I checked the megatron github repo, and according to this link, running megatron w/ H800 should work. (which is good to know)
What I am not sure is how compatible is MEgatron-LM-v1.1.5 with H800. Can you try to use the following CMD?
cd MolBART/megatron_molbart/Megatron-LM-v1.1.5-3D_parallelism pip install .
- Also, now you are using
pip install megatron-lm
, and another information that might be helpful is to print out thestate_dict['cuda_rng_state']
before the linetorch.cuda.set_rng_state(state_dict['cuda_rng_state'])
in the source code.
Hi, I checked the shape of RNG state of CUDA and found that H800 cannot fit with the checkpoint. Switching the GPU solves my problem, I will close this issue. Thank you for your kind reply:-)
Hello! Thank you for your excellent work! I hope to try the scripts you provided and have downloaded the relevant checkpoints following your tutorial. But when I used
python pretrain.py --verbose --batch_size=32 --molecule_type=SMILES --epochs=2
to run the pre-trained script, the following error occurred:How should I fix it? I'm using a server with
NVIDIA H800
, which hascuda==12.1
andpytorch==2.1.2
Thanks again 🙏