Oneflow-Inc / oneflow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
http://www.oneflow.org
Apache License 2.0
5.93k stars 667 forks source link

to_consistent pipeline model parallel bug #6783

Open player1321 opened 3 years ago

player1321 commented 3 years ago

Summary

to_consistent实现流水并行时同一op的不同参数被放到不同的GPU上导致无法运行。 image

Code to reproduce bug

` class BertForPreTraining(nn.Module):

def __init__(self, vocab_size, seq_length, hidden_size, hidden_layers, atten_heads, intermediate_size, hidden_act, hidden_dropout_prob, attention_probs_dropout_prob, max_position_embeddings, type_vocab_size,  initializer_range=0.02,):

    super().__init__()
    self.initializer_range = initializer_range
    self.bert = BertModel(
        vocab_size,
        seq_length,
        hidden_size,
        hidden_layers,
        atten_heads,
        intermediate_size,
        hidden_act,
        hidden_dropout_prob,
        attention_probs_dropout_prob,
        max_position_embeddings,
        type_vocab_size,
    )

    self.cls = BertPreTrainingHeads(hidden_size, vocab_size)
    self.cls.to_consistent(placement=P1, sbp=BROADCAST)
    self.cls.predictions.decoder.to_consistent(placement=P1, sbp=BROADCAST)

    self.init_weights()

def forward(self, input_ids, token_type_ids, attention_mask):
    sequence_output, pooled_output = self.bert(
        input_ids, token_type_ids, attention_mask
    )
    prediction_scores, seq_relationship_scores = self.cls(
        sequence_output, pooled_output
    )
    return prediction_scores, seq_relationship_scores

`

System Information

strint commented 3 years ago

All tensors of the same operation must have the same placement to execute properly.

Why do you want to place weight and bias on different devices?

If you want to do pipeline parallel, you can put two operations on different devices. This can be done by putting one operation's inputs tensor on device set A, then putting another operation's inputs tensor on another device set B.

strint commented 3 years ago

If you want to do pipeline parallel, you can refer to:

player1321 commented 3 years ago

@strint thanks for your reply. Actually, I want to put the whole module on device {0: [1]} following your tutorial, but I don't know why the weight and bias are placed in different devices. There should be something wrong with the weight, since when I put the module on {0: [1]}, the weight is on {0: [0]}, and I also tried to put the module on {0: [0]}, the weight is on {0: [1]}, while the bias is on the right device. The code is modified from your hugging_face_competition baseline.

strint commented 3 years ago

What about self.init_weights(), is there some to_consistent operations on linear's weight ?

You can check all the to_consistent operations on Module and Tensor.

player1321 commented 3 years ago

There are no to_consistent operations in self.init_weights(), will initialization operations change the device? I find output_embeddings.weight = input_embeddings.weight in self.init_weights(), it should be the bug. But output_embeddings.weight = output_embeddings.weight.to_consistent(placement=P1, sbp=BROADCAST) seems not the right way to handle this, got TypeError: TypeError: cannot assign '<class 'oneflow._oneflow_internal.Tensor'>' as parameter 'weight' (nn.Parameter or None expected)

strint commented 3 years ago

Try this:

output_embeddings.weight = nn.Parameter(output_embeddings.weight.to_consistent(placement=P1, sbp=BROADCAST))
player1321 commented 3 years ago

Thanks, it worked. But got another bug: `