xlarun --nproc_per_node=8 YOUR_MODEL.py

a1342772 commented 4 weeks ago

xlarun: command not found, I used the container you provided, but the command is not found.

a1342772 commented 4 weeks ago

@anw90 @YongCHN

anw90 commented 4 weeks ago

Which document are you referring to? xlarun is now deprecated. You can use torchrun directly, and take a look at the FSDP example: https://torchacc.readthedocs.io/en/latest/dist/fsdp.html#fsdp

a1342772 commented 4 weeks ago

tks

a1342772 commented 4 weeks ago

Does it support recommendation scenarios? Our features are column features. Below is our code: ` def _run_epoch(self, epoch: int, dataloader: DataLoader, train: bool = True):

    for _iter, (features, labels) in enumerate(dataloader):
        features = {feat_name: torch.as_tensor(data=feat_data, dtype=torch.long, device=self.gpu_id)
                    for feat_name, feat_data in features.items()}
        labels = {label_name: torch.as_tensor(data=label_data, dtype=torch.float, device=self.gpu_id)
                  for label_name, label_data in labels.items()}
        step_type = "Train" if train else "Eval"
        batch_loss = self._run_batch(features, labels, train)
 )`

` def _run_batch(self, features, labels, train: bool = True):

 with torch.set_grad_enabled(train), torch.amp.autocast(device_type="cuda", dtype=torch.float16,
                                                           enabled=self.config.use_amp):
        score = self.model(features)
        loss = self.cal_loss(score, labels)
    if train:
        self.optimizer.zero_grad(set_to_none=True)
        if self.config.use_amp:
            self.scaler.scale(loss).backward()
            if self.config.use_clip_grad:
                torch.nn.utils.clip_grad_norm_(self.model。(), self.config.grad_norm_clip)
            self.scaler.step(self.optimizer)
            self.scaler.update()
        else:
            loss.backward()
            if self.config.use_clip_grad:
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.grad_norm_clip)
            self.optimizer.step()
    return loss.item()`

a1342772 commented 4 weeks ago

@anw90 @Yancey1989 Can you help answer this question?

anw90 commented 4 weeks ago

We have not tested torchacc with CTR models before, but you can try it by wrapping your self.model with torchacc.accelerate. This document might be helpful to you: https://torchacc.readthedocs.io/en/latest/dist/dp.html.

AlibabaPAI / torchacc

xlarun --nproc_per_node=8 YOUR_MODEL.py #20