Open a1342772 opened 4 weeks ago
@anw90 @YongCHN
Which document are you referring to? xlarun is now deprecated. You can use torchrun directly, and take a look at the FSDP example: https://torchacc.readthedocs.io/en/latest/dist/fsdp.html#fsdp
tks
Does it support recommendation scenarios? Our features are column features. Below is our code: ` def _run_epoch(self, epoch: int, dataloader: DataLoader, train: bool = True):
for _iter, (features, labels) in enumerate(dataloader):
features = {feat_name: torch.as_tensor(data=feat_data, dtype=torch.long, device=self.gpu_id)
for feat_name, feat_data in features.items()}
labels = {label_name: torch.as_tensor(data=label_data, dtype=torch.float, device=self.gpu_id)
for label_name, label_data in labels.items()}
step_type = "Train" if train else "Eval"
batch_loss = self._run_batch(features, labels, train)
)`
` def _run_batch(self, features, labels, train: bool = True):
with torch.set_grad_enabled(train), torch.amp.autocast(device_type="cuda", dtype=torch.float16,
enabled=self.config.use_amp):
score = self.model(features)
loss = self.cal_loss(score, labels)
if train:
self.optimizer.zero_grad(set_to_none=True)
if self.config.use_amp:
self.scaler.scale(loss).backward()
if self.config.use_clip_grad:
torch.nn.utils.clip_grad_norm_(self.model。(), self.config.grad_norm_clip)
self.scaler.step(self.optimizer)
self.scaler.update()
else:
loss.backward()
if self.config.use_clip_grad:
torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.grad_norm_clip)
self.optimizer.step()
return loss.item()`
@anw90 @Yancey1989 Can you help answer this question?
We have not tested torchacc with CTR models before, but you can try it by wrapping your self.model with torchacc.accelerate. This document might be helpful to you: https://torchacc.readthedocs.io/en/latest/dist/dp.html.
xlarun: command not found, I used the container you provided, but the command is not found.