to_consistent pipeline model parallel bug

player1321 commented 3 years ago

Summary

to_consistent实现流水并行时同一op的不同参数被放到不同的GPU上导致无法运行。

Code to reproduce bug

` class BertForPreTraining(nn.Module):

def __init__(self, vocab_size, seq_length, hidden_size, hidden_layers, atten_heads, intermediate_size, hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, type_vocab_size,  initializer_range=0.02,):

    super().__init__()
    self.initializer_range = initializer_range
    self.bert = BertModel(
        vocab_size,
        seq_length,
        hidden_size,
        hidden_layers,
        atten_heads,
        intermediate_size,
        hidden_act,
        hidden_dropout_prob,
        attention_probs_dropout_prob,
        max_position_embeddings,
        type_vocab_size,
    )

    self.cls = BertPreTrainingHeads(hidden_size, vocab_size)
    self.cls.to_consistent(placement=P1, sbp=BROADCAST)
    self.cls.predictions.decoder.to_consistent(placement=P1, sbp=BROADCAST)

    self.init_weights()

def forward(self, input_ids, token_type_ids, attention_mask):
    sequence_output, pooled_output = self.bert(
        input_ids, token_type_ids, attention_mask
    )
    prediction_scores, seq_relationship_scores = self.cls(
        sequence_output, pooled_output
    )
    return prediction_scores, seq_relationship_scores

`

System Information

What is your OneFlow installation (pip, source, dockerhub):
OS: ubuntu 18.04
OneFlow version (run python3 -m oneflow --doctor): version: 0.5.0+cu110 cmake_build_type: Release rdma: True
Python version: 3.9.7
CUDA driver version: 450.80.02
GPU models: Ampere
Other info:

strint commented 3 years ago

All tensors of the same operation must have the same placement to execute properly.

Why do you want to place weight and bias on different devices?

If you want to do pipeline parallel, you can put two operations on different devices. This can be done by putting one operation's inputs tensor on device set A, then putting another operation's inputs tensor on another device set B.

strint commented 3 years ago

If you want to do pipeline parallel, you can refer to:

player1321 commented 3 years ago

@strint thanks for your reply. Actually, I want to put the whole module on device {0: [1]} following your tutorial, but I don't know why the weight and bias are placed in different devices. There should be something wrong with the weight, since when I put the module on {0: [1]}, the weight is on {0: [0]}, and I also tried to put the module on {0: [0]}, the weight is on {0: [1]}, while the bias is on the right device. The code is modified from your hugging_face_competition baseline.

strint commented 3 years ago

What about self.init_weights()， is there some to_consistent operations on linear's weight ?

You can check all the to_consistent operations on Module and Tensor.

player1321 commented 3 years ago

There are no to_consistent operations in self.init_weights(), will initialization operations change the device? I find output_embeddings.weight = input_embeddings.weight in self.init_weights(), it should be the bug. But output_embeddings.weight = output_embeddings.weight.to_consistent(placement=P1, sbp=BROADCAST) seems not the right way to handle this, got TypeError: TypeError: cannot assign '<class 'oneflow._oneflow_internal.Tensor'>' as parameter 'weight' (nn.Parameter or None expected)

strint commented 3 years ago

Try this:

output_embeddings.weight = nn.Parameter(output_embeddings.weight.to_consistent(placement=P1, sbp=BROADCAST))

player1321 commented 3 years ago

Thanks, it worked. But got another bug: `

OFRECORD_PATH=sample_seq_len_512_example
CHECKPOINT_PATH=checkpoints
'[' '!' -d checkpoints ']'
LEARNING_RATE=1e-4
EPOCH=10
TRAIN_BATCH_SIZE=2
VAL_BATCH_SIZE=2
python3 -m oneflow.distributed.launch --nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 17789 run_pretraining.py --ofrecord_path sample_seq_len_512_example --checkpoint_path checkpoints --lr 1e-4 --epochs 10 --train-batch-size 2 --val-batch-size 2 --seq_length=512 --max_predictions_per_seq=80 --num_hidden_layers=24 --num_attention_heads=16 --hidden_size=1024 --max_position_embeddings=512 --type_vocab_size=2 --vocab_size=30522 --attention_probs_dropout_prob=0.1 --hidden_dropout_prob=0.1 --use_consistent libibverbs not available, ibv_fork_init skipped libibverbs not available, ibv_fork_init skipped libibverbs not available, ibv_fork_init skipped Device is: cuda Creating Dataloader Building BERT Model Device is: cuda Creating Dataloader Building BERT Model Killing subprocess 29495 Killing subprocess 29496 Traceback (most recent call last): File "/anaconda3/envs/oneflow/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/anaconda3/envs/oneflow/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/anaconda3/envs/oneflow/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 211, in main() File "/anaconda3/envs/oneflow/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 199, in main sigkill_handler(signal.SIGTERM, None) File "/anaconda3/envs/oneflow/lib/python3.6/site-packages/oneflow/distributed/launch.py", line 168, in sigkill_handler returncode=last_return_code, cmd=cmd subprocess.CalledProcessError: Command '['/anaconda3/envs/oneflow/bin/python3', '-u', 'run_pretraining.py', '--ofrecord_path', 'sample_seq_len_512_example', '--checkpoint_path', 'checkpoints', '--lr', '1e-4', '--epochs', '10', '--train-batch-size', '2', '--val-batch-size', '2', '--seq_length=512', '--max_predictions_per_seq=80', '--num_hidden_layers=24', '--num_attention_heads=16', '--hidden_size=1024', '--max_position_embeddings=512', '--type_vocab_size=2', '--vocab_size=30522', '--attention_probs_dropout_prob=0.1', '--hidden_dropout_prob=0.1', '--use_consistent']' died with <Signals.SIGFPE: 8>. `

Oneflow-Inc / oneflow