Hyperparameters for training from scratch

Hi, Thanks for sharing this great work! I'm trying to train from scratch on a dataset of around ~900k protein sequences and I am having some trouble with getting an intuition about what hyperparameters are reasonable to use. Is the provided example config a good place to start? I couldn't find very detailed information about which set of hyperparameters you used to train models in the paper. I'm currently using the below configs:

{
    "architectures": [
      "XLNetLMHeadModel"
    ],
    "attn_type": "bi",
    "bi_data": false,
    "bos_token_id": 14,
    "clamp_len": -1,
    "d_head": 16,
    "d_inner": 1024,
    "d_model": 256,
    "dropout": 0.2,
    "end_n_top": 5,
    "eos_token_id": 14,
    "ff_activation": "gelu",
    "initializer_range": 0.02,
    "language": "AAS",
    "layer_norm_eps": 1e-12,
    "mem_len": null,
    "model_type": "xlnet",
    "n_head": 16,
    "n_layer": 32,
    "numerical_encodings_dim": 16,
    "numerical_encodings_format": "sum",
    "numerical_encodings_type": "float",
    "pad_token_id": 0,
    "reuse_len": null,
    "same_length": false,
    "start_n_top": 5,
    "summary_activation": "tanh",
    "summary_last_dropout": 0.1,
    "summary_type": "last",
    "summary_use_proj": true,
    "task_specific_params": {
      "text-generation": {
        "do_sample": true,
        "max_length": 250
      }
    },
    "untie_r": true,
    "use_numerical_encodings": true,
    "vmax": 1.0,
    "vocab_size": 511
  }

{
    "reset_training_loss": true,
    "alternate_tasks": true,
    "cc_loss": true,
    "property_tokens": [
        "<kdpe>"
    ],
    "alternate_steps": 50,
    "checkpoint-str": "best",
    "cg_collator": "vanilla_cg",
    "cg_collator_params": {
        "do_sample": false,
        "property_tokens": [
            "<kdpe>"
        ],
        "plm_probability": 0.4,
        "max_span_length": 12
    }
}

and training with the following flags:

#!/bin/bash

python scripts/run_language_modeling.py \
    --output_dir rt_example \
    --config_name configs/rt_small.json \
    --tokenizer_name ./vocabs/smallmolecules.txt \
    --do_train \
    --do_eval \
    --learning_rate 1e-4 \
    --num_train_epochs 5 \
    --save_total_limit 2 \
    --save_steps 500 \
    --per_gpu_train_batch_size 16 \
    --evaluate_during_training \
    --eval_steps 5 \
    --eval_data_file ./examples/qed_property_example.txt\
     --train_data_file ./examples/qed_property_example.txt \
    --line_by_line \
    --block_size 510 \
    --seed 42 \
    --logging_steps 100 \
    --eval_accumulation_steps 2 \
    --training_config_path training_configs/qed_alternated_cc.json \
    --overwrite_output_dir \
    --no_cuda

Could you also provide some recommendations on what hardware to use? Thanks!

Hi @TheMatrixMaster,

Thanks for your interest in this work. Training from scratch is discouraged in favor of finetuning with gt4sd-trainer as described in the GT4SD README: https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer

Please use GT4SD because it uses an improved version of the RT which is exposed in this repo only in the gt4sd branch but not in main.

Since you are working on protein sequences, for your finetuning, please start from the stability model, as explained in the GT4SD example linked above. That model was pretrained on few million protein sequences with a synthetic property (Boman index) and then finetuned on the dataset used in the TAPE paper on predicting protein stability. See the RT paper for details.

About hardware: Single-GPU usage should be fine, as the code is likely not out-of-the-box-compatible with multi-GPU. Please be aware that the training is not super fast due to teh XLNet backbone.

Parameter: Which ones are you looking for an intuition? In general, please read docstring in GT4SD: https://github.com/GT4SD/gt4sd-core/blob/daae05b8846563501c4a10245ec3bfa7c1982e47/src/gt4sd/training_pipelines/regression_transformer/core.py#L40

Closing as completed but feel free to reopen/comment in case of more questions

IBM / regression-transformer

Hyperparameters for training from scratch #22