Closed frankxyy closed 1 year ago
It seems that there exists a DropVar in the microbatch_bound.outvars.
Plus: This error happens for implementation of train_step like this:
def train_step(state, batch):
def compute_loss(params):
labels = batch.pop("labels")
logits = state.apply_fn(**batch, params=params, train=True)[0]
loss = loss_fn(logits, labels)
return loss
grad_fn = alpa.value_and_grad(compute_loss)
loss, grad = grad_fn(state.params)
new_state = state.apply_gradients(grads=grad)
metrics = {"loss": loss, "learning_rate": linear_decay_lr_schedule_fn(state.step)}
return new_state, metrics
but this error not happens for implementation of train_step like this:
def train_step(state, batch):
def compute_loss(params):
labels = batch.pop("labels")
logits = state.apply_fn(**batch, params=params, train=True)[0]
loss = loss_fn(logits, labels)
return loss
grad_fn = alpa.value_and_grad(compute_loss)
loss, grad = grad_fn(state.params)
new_state = state.apply_gradients(grads=grad)
metrics = {"loss": loss}
return new_state, None
@ZYHowell Could you take a look?
Seems like this assertion is removed in https://github.com/alpa-projects/alpa/pull/681. Could you please try the nightly alpa?
closed due to inactivity
I run pipeshard parallel for vit-large model. The relevant code is:
Note that I set num_micro_batches to 1.
After running, the error msg below is thrown out:
I debug to the error-happening line of code and find that length of gradients is 392 and length of microbatch_bound.invars is 393. The value stored in is: