annambiar / PRoBERTa

18 stars 6 forks source link

Trying to get PPI prediction to run #5

Open JudithBernett opened 1 year ago

JudithBernett commented 1 year ago

Hi everyone, I am trying to get the PPI prediction part of PRoBERTa to run but I need help with the package versions. Can you specify which versions of python, numpy and pytorch you used?

Thank you very much already in advance! Best, Judith

Here is my concrete problem:

I first made a conda environment following the README.md:

conda create env -n proberta python=3.8
conda activate proberta

pip3 install sentencepiece

git clone https://github.com/imonlius/fairseq.git
cd fairseq
pip3 install --editable . --no-binary cffi

I want to run pRoBERTa_finetune_ppi.sh

The example command from the Readme is:

bash pRoBERTa_finetune_ppi.sh ppi 4 ppi_prediction \
        ppi_prediction/split_binarized/robustness_minisplits/0.80/ \
        768 5 12500 312 0.0025 32 64 2 3 \
        pretraining/checkpoint_best.pt \
        no
Name Description Example
PREFIX Prefix for the model output files ppi
NUM_GPUS Number of GPUs to use for finetuning 4
OUTPUT_DIR Model output directory ppi_prediction
DATA_DIR Binarized input data directory ppi_prediction/split_binarized/robustness_minisplits/1.00
ENCODER_EMBED_DIM Dimension of embedding generated by the encoders 768
ENCODER_LAYERS Number of encoder layers in the model 5
TOTAL_UPDATES Total (maximum) number of updates during training 12500
WARMUP_UPDATES Total number of LR warm-up updates during training 3125
PEAK_LEARNING_RATE Peak learning rate for training 0.0025
MAX_SENTENCES Maximum number of sequences in each batch 32
UPDATE_FREQ Updates the model every UPDATE_FREQ batches 64
PATIENCE Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs 3
PRETRAIN_CHECKPOINT Path to pretrained model checkpoint pretraining/checkpoint_best.pt
RESUME_TRAINING Whether to resume training from previous finetuned model checkpoints no

I'm downloading pretraining/checkpoint_best.pt and ppi_prediction/split_binarized/robustness_minisplits/1.00 from the specified links.

Problem 1: The command is missing an argument:

An error is thrown: fairseq-train: error: argument --use-cls-token: expected one argument

I'm fixing the command (I also just have 3 GPUs):

bash pRoBERTa_finetune_ppi.sh ppi 3 ppi_prediction \
        ppi_prediction/split_binarized/robustness_minisplits/1.00/ \
        768 5 12500 312 0.0025 32 64 2 3 \
        pretraining/checkpoint_best.pt \
        no 1

Problem 2: Wrong numpy version

The following error is thrown:

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Solution: I downgrade to numpy 1.20.3

pip uninstall numpy
pip install numpy==1.20.3

Problem 3: No module named 'apex'

The program runs further but now stops at a later point to throw this error:

torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "[...]/PRoBERTa/fairseq/fairseq/optim/fused_lamb.py", line 16, in __init__
    from apex.optimizers import FusedLAMB
ModuleNotFoundError: No module named 'apex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "[...]/miniconda3/envs/proberta/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "[...]/PRoBERTa/fairseq/fairseq_cli/train.py", line 338, in distributed_main
    main(args, init_distributed=True)
  File "[...]/PRoBERTa/fairseq/fairseq_cli/train.py", line 104, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "[...]/PRoBERTa/fairseq/fairseq/checkpoint_utils.py", line 132, in load_checkpoint
    extra_state = trainer.load_checkpoint(
  File "[...]/PRoBERTa/fairseq/fairseq/trainer.py", line 281, in load_checkpoint
    self.lr_step(epoch)
  File "[...]/PRoBERTa/fairseq/fairseq/trainer.py", line 620, in lr_step
    self.lr_scheduler.step(epoch, val_loss)
  File "[...]/PRoBERTa/fairseq/fairseq/trainer.py", line 168, in lr_scheduler
    self._build_optimizer()  # this will initialize self._lr_scheduler
  File "[...]/PRoBERTa/fairseq/fairseq/trainer.py", line 190, in _build_optimizer
    self._optimizer = optim.FP16Optimizer.build_optimizer(self.args, params)
  File "[...]/PRoBERTa/fairseq/fairseq/optim/fp16_optimizer.py", line 263, in build_optimizer
    fp32_optimizer = optim.build_optimizer(args, fp32_params)
  File "[...]/PRoBERTa/fairseq/fairseq/registry.py", line 41, in build_x
    return builder(args, *extra_args, **extra_kwargs)
  File "[...]/PRoBERTa/fairseq/fairseq/optim/fused_lamb.py", line 19, in __init__
    raise ImportError('Please install apex to use LAMB optimizer')
ImportError: Please install apex to use LAMB optimizer

Solution: I install apex according to the apex github instructions:

  1. The apex git states: To install Apex from source, we recommend using the nightly Pytorch. Our system runs on CUDA 11.8, so we execute:
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
  2. Apex git: For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via:
    git clone https://github.com/NVIDIA/apex
    cd apex
    # if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
    # otherwise
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

    I have pip 23.1.2, so I execute the first command.

Problem 4: Apex installation fails

The following error message is displayed:

  ModuleNotFoundError: No module named 'packaging'
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: [...]/miniconda3/envs/proberta/bin/python3.8 [...]/miniconda3/envs/proberta/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpzfst6ax8
  cwd: [...]/PRoBERTa/apex
  Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Solution: install packaging separately:

pip3 install packaging

And re-execute command. Then, apex-0.1 is successfully installed.

I try executing the command again.

Problem 5: Numpy error

Validation is loaded, the model is initialized, training data is loaded, but then, this error is thrown:

         pretraining/checkpoint_best.pt         no 1
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:14799
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | distributed init (rank 2): tcp://localhost:14799
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | initialized host gpu02.exbio.wzw.tum.de as rank 2
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:14799
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | initialized host gpu02.exbio.wzw.tum.de as rank 1
2023-06-22 15:51:40 | INFO | fairseq.distributed_utils | initialized host gpu02.exbio.wzw.tum.de as rank 0
2023-06-22 15:51:43 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='gelu', add_prev_output_tokens=False, all_gather_list_size=16384, arch='roberta_base', attention_dropout=0.1, best_checkpoint_metric='accuracy', bf16=False, bpe='sentencepiece', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', classification_head_name='protein_interaction_prediction', clip_norm=25, cpu=False, criterion='sentence_prediction', curriculum=0, data='ppi_prediction/split_binarized/robustness_minisplits/1.00/', data_buffer_size=0, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:14799', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=3, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layerdrop=0, encoder_layers=5, encoder_layers_to_keep=None, end_learning_rate=0.0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=True, fp16_scale_tolerance=0.0, fp16_scale_window=None, init_token=0, keep_best_checkpoints=-1, keep_interval_updates=5, keep_last_epochs=-1, lamb_betas='(0.9, 0.999)', lamb_eps=1e-08, localsgd_frequency=3, log_format='simple', log_interval=1000, lr=[0.0025], lr_scheduler='polynomial_decay', max_epoch=0, max_positions=512, max_sentences=32, max_sentences_valid=32, max_tokens=None, max_tokens_valid=None, max_update=12500, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_shuffle=False, nprocs_per_node=3, num_classes=2, num_workers=1, optimizer='lamb', optimizer_overrides='{}', patience=3, pooler_activation_fn='tanh', pooler_dropout=0.0, power=1.0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, regression_target=False, required_batch_size_multiple=8, reset_dataloader=True, reset_lr_scheduler=False, reset_meters=True, reset_optimizer=True, restore_file='pretraining/checkpoint_best.pt', save_dir='ppi_prediction/ppi.DIM_768.LAYERS_5.UPDATES_12500.WARMUP_312.LR_0.0025.BATCH_6144.PATIENCE_3/checkpoints', save_interval=1, save_interval_updates=100, seed=1, sentence_avg=False, sentencepiece_vocab=None, separator_token=2, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, task='sentence_prediction', tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, total_num_update=12500, tpu=False, train_subset='train', truncate_sequence=True, update_freq=[64], use_bmuf=False, use_cls_token='1', user_dir=None, valid_subset='valid', validate_interval=1, warmup_updates=312, weight_decay=0.01)
2023-06-22 15:51:43 | INFO | fairseq.tasks.sentence_prediction | [input] dictionary: 10009 types
2023-06-22 15:51:43 | INFO | fairseq.tasks.sentence_prediction | [label] dictionary: 13 types
2023-06-22 15:51:43 | INFO | fairseq.data.data_utils | loaded 60057 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/input0/valid
2023-06-22 15:51:43 | INFO | fairseq.data.data_utils | loaded 60057 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/input1/valid
2023-06-22 15:51:43 | INFO | fairseq.data.data_utils | loaded 60057 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/label/valid
2023-06-22 15:51:43 | INFO | fairseq.tasks.sentence_prediction | Loaded valid with #samples: 60057
2023-06-22 15:51:44 | INFO | fairseq_cli.train | RobertaModel(
  (decoder): RobertaEncoder(
    (sentence_encoder): TransformerSentenceEncoder(
      (embed_tokens): Embedding(10009, 768, padding_idx=1)
      (embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)
      (layers): ModuleList(
        (0-4): 5 x TransformerSentenceEncoderLayer(
          (self_attn): MultiheadAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
        )
      )
      (emb_layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): RobertaLMHead(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (layer_norm): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
    )
  )
  (classification_heads): ModuleDict(
    (protein_interaction_prediction): RobertaClassificationHead(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.0, inplace=False)
      (out_proj): Linear(in_features=768, out_features=2, bias=True)
    )
  )
)
2023-06-22 15:51:44 | INFO | fairseq_cli.train | model roberta_base, criterion SentencePredictionCriterion
2023-06-22 15:51:44 | INFO | fairseq_cli.train | num. model params: 44716827 (num. trained: 44716827)
2023-06-22 15:51:45 | INFO | fairseq.trainer | detected shared parameter: decoder.sentence_encoder.embed_tokens.weight <- decoder.lm_head.weight
2023-06-22 15:51:45 | INFO | fairseq_cli.train | training on 3 devices (GPUs/TPUs)
2023-06-22 15:51:45 | INFO | fairseq_cli.train | max tokens per GPU = None and max sentences per GPU = 32
2023-06-22 15:51:45 | INFO | fairseq.models.roberta.model | Overwriting classification_heads.protein_interaction_prediction.dense.weight
2023-06-22 15:51:45 | INFO | fairseq.models.roberta.model | Overwriting classification_heads.protein_interaction_prediction.dense.bias
2023-06-22 15:51:45 | INFO | fairseq.models.roberta.model | Overwriting classification_heads.protein_interaction_prediction.out_proj.weight
2023-06-22 15:51:45 | INFO | fairseq.models.roberta.model | Overwriting classification_heads.protein_interaction_prediction.out_proj.bias
2023-06-22 15:51:45 | INFO | fairseq.trainer | loaded checkpoint pretraining/checkpoint_best.pt (epoch 215 @ 0 updates)
2023-06-22 15:51:45 | INFO | fairseq.trainer | loading train data for epoch 1
2023-06-22 15:51:45 | INFO | fairseq.data.data_utils | loaded 480455 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/input0/train
2023-06-22 15:51:45 | INFO | fairseq.data.data_utils | loaded 480455 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/input1/train
2023-06-22 15:51:45 | INFO | fairseq.data.data_utils | loaded 480455 examples from: ppi_prediction/split_binarized/robustness_minisplits/1.00/label/train
2023-06-22 15:51:45 | INFO | fairseq.tasks.sentence_prediction | Loaded train with #samples: 480455
[...]/PRoBERTa/fairseq/fairseq/trainer.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._grad_norm_buf = torch.cuda.DoubleTensor(self.data_parallel_world_size)
[...]/PRoBERTa/fairseq/fairseq/trainer.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._grad_norm_buf = torch.cuda.DoubleTensor(self.data_parallel_world_size)
[...]/PRoBERTa/fairseq/fairseq/trainer.py:90: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._grad_norm_buf = torch.cuda.DoubleTensor(self.data_parallel_world_size)
Traceback (most recent call last):
  File "[...]/miniconda3/envs/proberta/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "[...]/PRoBERTa/fairseq/fairseq_cli/train.py", line 367, in cli_main
    torch.multiprocessing.spawn(
  File "[...]/miniconda3/envs/proberta/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "[...]/miniconda3/envs/proberta/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "[...]/miniconda3/envs/proberta/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "[...]/miniconda3/envs/proberta/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "[...]/PRoBERTa/fairseq/fairseq_cli/train.py", line 338, in distributed_main
    main(args, init_distributed=True)
  File "[...]/PRoBERTa/fairseq/fairseq_cli/train.py", line 104, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
  File "[...]/PRoBERTa/fairseq/fairseq/checkpoint_utils.py", line 156, in load_checkpoint
    epoch_itr = trainer.get_train_iterator(
  File "[...]/PRoBERTa/fairseq/fairseq/trainer.py", line 312, in get_train_iterator
    return self.task.get_batch_iterator(
  File "[...]/PRoBERTa/fairseq/fairseq/tasks/fairseq_task.py", line 176, in get_batch_iterator
    batch_sampler = data_utils.batch_by_size(
  File "[...]/PRoBERTa/fairseq/fairseq/data/data_utils.py", line 220, in batch_by_size
    from fairseq.data.data_utils_fast import batch_by_size_fast
  File "fairseq/data/data_utils_fast.pyx", line 1, in init fairseq.data.data_utils_fast
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

This issue is known: https://github.com/facebookresearch/fairseq/issues/3203, https://stackoverflow.com/questions/65974718/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility

However, the solution was to upgrade from numpy 1.18 to 1.20.0. However, I am already on numpy 1.20.3.

Downgrading to numpy 1.20.0:

pip uninstall numpy
pip install numpy==1.20.0

The error persists. I cannot upgrade numpy due to problem 2.

annambiar commented 1 year ago

Hi Judith. We are currently working on migrating PRoBERTa from fairseq to HuggingFace (partly because of issues similar to yours). HuggingFace is also a lot simpler to use. We expect that this change should happen in the next couple weeks. I'll leave another comment here when we've successfully made the change.