Training and Running Inference Problems

5 months ago

amelie-iska commented 5 months ago

Assuming I have downloaded the box folder with the data, tokenizers, and trained_model folders, could you please provide an example of how to run evaluation on box/data/qed/chembl_selfies_eval.txt and how to run inference on a single selfies example properly? I am finding it quite difficult to understand the desired folder and file structure for the pretrained checkpoints when running the evaluation script.

For example:

(rt) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/regression-transformer$ python scripts/ \
--output_dir ./box/trained_models/qed \
--eval_file ./box/data/qed/chembl_selfies_eval.txt \
--eval_accumulation_steps 2 \
--param_path configs/qed_eval.json
WARNING:terminator.utils:No checkpoints found that contain  in ./box/trained_models/qed.       
WARNING:terminator.utils:No checkpoints found that contain checkpoint in ./box/trained_models/qed.
Additionally I am finding training runs to be quite difficult as well, which would probably help with understanding how the checkpoint folders and files are supposed to be structured upon saving. Perhaps you could provide an example command for running a script to train a qed model as well?

I have tried running the following, but get errors about the tokenizer not having a particular attribute:

(rt) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/regression-transformer$ python scripts/ --output_dir ./new_trained_models \
    --config_name configs/rt_small.json --tokenizer_name ./vocabs/smallmolecules.txt \
    --do_train --do_eval --learning_rate 1e-4 --num_train_epochs 1 --save_total_limit 2 \
    --save_steps 500 --per_gpu_train_batch_size 16 --evaluate_during_training --eval_steps 5 \
    --eval_data_file ./box/data/qed/chembl_selfies_eval.txt --train_data_file ./box/data/qed/chembl_selfies_train.txt \
    --line_by_line --block_size 510 --seed 42 --logging_steps 100 --eval_accumulation_steps 2 \
    --training_config_path training_configs/qed_alternated_cc.json
PyTorch: setting up devices
WARNING:__main__:Process rank: -1, device: cpu, n_gpu: 0, distributed training: False
INFO:__main__:Training/evaluation parameters CustomTrainingArguments(output_dir='./new_trained_models', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=True, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=16, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=0.0001, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jan21_16-28-06_LZ16-ASR50-DSA', logging_first_step=False, logging_steps=100, save_steps=500, save_total_limit=2, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=5, past_index=-1, run_name=None, disable_tqdm=False, remove_unused_columns=True, eval_accumulation_steps=2, training_config_path='training_configs/qed_alternated_cc.json')
loading configuration file configs/rt_small.json
/home/asr50/miniconda3/envs/rt/lib/python3.7/site-packages/transformers/ FutureWarning: This config doesn't use attention memories, a core feature of XLNet. Consider setting `men_len` to a non-zero value, for example `xlnet = XLNetLMHeadModel.from_pretrained('xlnet-base-cased'', mem_len=1024)`, for accurate training performance as well as an order of magnitude faster inference. Starting from version 3.5.0, the default parameter will be 1024, following the implementation in
Model config XLNetConfig {
  "architectures": [
  "attn_type": "bi",
  "bi_data": false,
  "bos_token_id": 14,
  "clamp_len": -1,
  "d_head": 16,
  "d_inner": 1024,
  "d_model": 256,
  "dropout": 0.2,
  "end_n_top": 5,
  "eos_token_id": 14,
  "ff_activation": "gelu",
  "initializer_range": 0.02,
  "language": "selfies",
  "layer_norm_eps": 1e-12,
  "mem_len": null,
  "model_type": "xlnet",
  "n_head": 16,
  "n_layer": 32,
  "numerical_encodings_dim": 16,
  "numerical_encodings_format": "sum",
  "numerical_encodings_type": "float",
  "pad_token_id": 0,
  "reuse_len": null,
  "same_length": false,
  "start_n_top": 5,
  "summary_activation": "tanh",
  "summary_last_dropout": 0.1,
  "summary_type": "last",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 250
  "untie_r": true,
  "use_numerical_encodings": true,
  "vmax": 1.0,
  "vocab_size": 507

Model name './vocabs/smallmolecules.txt' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-w-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming './vocabs/smallmolecules.txt' is a path, a model identifier, or url to a directory containing tokenizer files.
Calling ExpressionBertTokenizer.from_pretrained() with the path to a single file or url is deprecated
loading file ./vocabs/smallmolecules.txt
INFO:__main__:Training new model from scratch
/home/asr50/miniconda3/envs/rt/lib/python3.7/site-packages/transformers/ FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
INFO:__main__:PyTorch version: 1.13.1
/home/asr50/miniconda3/envs/rt/lib/python3.7/site-packages/transformers/ FutureWarning: The `max_len` attribute has been deprecated and will be removed in a future version, use `model_max_length` instead.
Creating features from dataset file at ./box/data/qed/chembl_selfies_train.txt
Creating features from dataset file at ./box/data/qed/chembl_selfies_eval.txt
INFO:__main__:Dataset sizes 1395602, 1000.
INFO:__main__:Number of parameters 27508219 of type <class 'transformers.modeling_xlnet.XLNetLMHeadModel'>
INFO:__main__:Training with alternate tasks
/home/asr50/miniconda3/envs/rt/lib/python3.7/site-packages/transformers/ FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead.
You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.
You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see
INFO:terminator.trainer:Verbose evaluation True
INFO:terminator.trainer:Attempting to use numerical encodings.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
INFO:terminator.trainer:***** Running training *****
INFO:terminator.trainer:Model device cpu
INFO:terminator.trainer:  Num examples = 1395602
INFO:terminator.trainer:  Num Epochs = 1
INFO:terminator.trainer:  Instantaneous batch size per device = 8
INFO:terminator.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 16
INFO:terminator.trainer:  Gradient Accumulation steps = 1
INFO:terminator.trainer:  Total optimization steps = 87226
Epoch:   0%|                                                             | 0/1 [00:00<?, ?it/sWARNING:terminator.trainer:Loading alternative collator for evaluation.:20<114:40:45,  4.73s/it]
INFO:terminator.trainer:***** Running Evaluation *****
INFO:terminator.trainer:  Num examples = 1000
INFO:terminator.trainer:  Batch size = 8
Evaluation: 100%|████████████████████████████████████████████| 125/125 [01:09<00:00,  1.79it/s]
{'eval_loss': 4.566575050354004, 'epoch': 5.7322358012519205e-05, 'step': 5}
INFO:terminator.trainer:Evaluation {'eval_loss': 4.566575050354004, 'epoch': 5.7322358012519205e-05}
You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it.
You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see
INFO:terminator.trainer:Verbose evaluation True
jannisborn commented 5 months ago

HI @amelie-iska,

thanks for your interest in the work. In general, this repo is not actively maintained in favor of the RT's availaibility in GT4SD. Therefore, unless you want to exactly reproduce experiments from the paper, we always recommend using GT4SD.

Please install GT4SD from source and then training a RT model can be done from CLI with gt4sd-trainer .... The GT4SD repo has several examples on this in the main README or the examples folder, as well as closed issues that are describing the procedure. Please note that GT4SD uses a slightly updated version of the RT's code which is available in this repo under the gt4sd branch.

Regarding the training code: Did you just run the example from the README here? If yes, it might indeed be a bug. But

== Regarding your current evaluation code: it is failing because the of the path to the checkpoint which either does not exist or is not explicit, so the model recursively tries to find it, but since it does not manage, it enters a recursive loop.

amelie-iska commented 5 months ago

I am consistently having issues setting up the conda environment for GT4SD. I've tried creating the environment directly from the gpu_conda.yml file but this does not work. I've tried installing the dependencies individually by hand starting with a python=3.8 conda environment, and installing pytorch using conda/mamba with:

mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

After installing requirements.txt and gpu_requirements.txt there is always an issue with

pip install git+


pip install git+

in the version control system requirements file. When attempting to run gt4sd-inference --help the process is aborted and I get the following:

(gt4sd-py38b) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ gt4sd-inference --help
2024-01-22 12:43:50.175484: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 12:43:50.268301: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-22 12:43:50.291889: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-22 12:43:50.743712: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-22 12:43:50.743864: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-22 12:43:50.743885: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/asr50/miniconda3/envs/gt4sd-py38b/lib/python3.8/site-packages/torchvision/io/ UserWarning: Failed to load image Python extension: '/home/asr50/miniconda3/envs/gt4sd-py38b/lib/python3.8/site-packages/torchvision/ undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE'If you don't plan on using image functionality from ``, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(gt4sd-py38b) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$
jannisborn commented 5 months ago

It's surprising that you cant manage to set up a GT4SD env. Which OS do you have? Be aware that M1 Macs are not supported:

Also, when creating the env:

git clone
cd gt4sd-core/
conda env create -f conda_gpu.yml 
conda activate gt4sd

have you tried replacing pip install gt4sd with pip install -e ., i.e., the developer setup?

amelie-iska commented 5 months ago

Hmm, I'm getting new errors today. I am using WSL on a Windows machine. I was able to follow the instructions above just fine and get the GPU conda environment working. However, now I'm getting errors when running the following help command for inference:

(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ gt4sd-inference --help
2024-01-23 13:05:32.154405: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-23 13:05:32.261593: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-23 13:05:32.282343: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-23 13:05:32.711339: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-23 13:05:32.711422: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-23 13:05:32.711429: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
jannisborn commented 5 months ago

Yes, this one I've seen before, it has to do with the installation in editable mode. If you go to gt4sd folder and do:

pip install .

then I'm pretty sure it works afterwards

amelie-iska commented 5 months ago
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core/src/gt4sd$ pip install .
ERROR: Directory '.' is not installable. Neither '' nor 'pyproject.toml' found.
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core/src/gt4sd$
jannisborn commented 5 months ago

cd ../.. && pip install .

amelie-iska commented 5 months ago

New error now.

(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ gt4sd-inference --help
2024-01-23 13:28:36.894862: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.    
2024-01-23 13:28:36.999321: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-23 13:28:37.020478: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-23 13:28:37.372029: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-23 13:28:37.372109: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2024-01-23 13:28:37.372128: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
jannisborn commented 5 months ago

This is related to gt4sd-molformer, what do you get with:

pip freeze | grep fast pip freeze | grep mol ?

amelie-iska commented 5 months ago
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ pip freeze | grep fast
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ pip freeze | grep mol
gt4sd @ file:///mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core
guacamol-baselines @ git+
molecule-generation @ git+
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$
jannisborn commented 5 months ago

The part about the CPU is interesting. It's a problem in the interaction between fast-transformers and pytorch. What does this give you?

import torch
jannisborn commented 5 months ago

Oh I just read that you are using WSL. I have no experience with Windows/WSL, but I would be surprised if you can use your windows GPU directly through standard torch installation. The error above indicates a potential GPU/CPU problem

amelie-iska commented 5 months ago

Hmm, yeah. I've trained using the GPU and WSL before, so I know it's possible. I was even able to train a Regression Transformer at one point (albeit with errors). The checkpoints saved fine though. I just couldn't get inference to work properly, primarily due to not understanding what folder/file structure to provide when trying to run the script. But you are right, for some reason CUDA doesn't seem available in this environment.

(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$ py
Python 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) 
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
>>> print(torch.cuda.is_available())
>>> exit()
(gt4sd) asr50@LZ16-ASR50-DSA:/mnt/e/users/asr50/vs_code_projects/small_molecules/gt4sd-core$