determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.99k stars 347 forks source link

🤔[question] where to set the `find_unused_parameters=True` #9814

Closed caiduoduo12138 closed 1 month ago

caiduoduo12138 commented 1 month ago

Describe your question

Due to I freeze some layer in the network, some parms cannot be update. So, I want to set the parm find_unused_parameters in DistributedDataParallel to solve this problem. here is my yaml:

bind_mounts:
  - host_path: /home/cai/project/determined-0.34.0/examples/object_detection/checkpoints/
    container_path: /mnt/checkpoints/
  - host_path: /home/cai/project/determined-0.34.0/examples/object_detection/dataset/
    container_path: /mnt/dataset/
description: an object detection task
entrypoint: Det:Det
hyperparameters:
  global_batch_size: 16
max_restarts: 0
name: fcos
resources:
  slots_per_trial: 4
scheduling_unit: 1
records_per_epoch: 60000
min_checkpoint_period:
  epochs: 1
min_validation_period:
  epochs: 1
searcher:
  max_length:
    epochs: 12
  metric: mAP
  name: single
  smaller_is_better: false
labels:
- caida

here is my code:

"""
This example shows how to interact with the Determined PyTorch interface to
build a basic object detection task.

In the `__init__` method, the model and optimizer are wrapped with `wrap_model`
and `wrap_optimizer`. This model is single-input and single-output.

The methods `train_batch` and `evaluate_batch` define the forward pass
for training and evaluation respectively.
"""

from typing import Any, Dict, Sequence, List, Union, cast
import torch
from determined.pytorch import DataLoader, PyTorchTrial, PyTorchTrialContext
from model.fcos import FCOSDetector
from dataset.COCO_dataset import COCODataset
from dataset.augment import Transforms
from eval import COCOGenerator,  evaluate_coco

TorchData = Union[Dict[str, torch.Tensor], Sequence[torch.Tensor], torch.Tensor]

class Det(PyTorchTrial):
    def __init__(self, context: PyTorchTrialContext) -> None:
        self.context = context
        self.model = self.context.wrap_model(FCOSDetector(mode="training"))
        self.optimizer = self.context.wrap_optimizer(
            torch.optim.SGD(self.model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0001)
        )
        self.generator = COCOGenerator("/mnt/dataset/dataset/val_images", '/mnt/dataset/annotations/val.json')

    def build_training_data_loader(self) -> DataLoader:
        transform = Transforms()
        train_data = COCODataset("/mnt/dataset/images", '/mnt/dataset/annotations/train.json', transform=transform)
        return DataLoader(train_data, batch_size=self.context.get_per_slot_batch_size(),  collate_fn=train_data.collate_fn, shuffle=True)

    def build_validation_data_loader(self) -> DataLoader:
        transform = Transforms()
        validation_data = COCODataset("/mnt/dataset/val_images", '/mnt/dataset/annotations/val.json', transform=transform)
        return DataLoader(validation_data, batch_size=self.context.get_per_slot_batch_size(), shuffle=False)

    def train_batch(
        self, batch: TorchData, epoch_idx: int, batch_idx: int
    ) -> Dict[str, torch.Tensor]:
        batch = list(batch)
        # imgs, bboxes, classes = batch
        loss = self.model(batch)
        loss = loss[-1].mean()
        self.context.backward(loss)
        self.context.step_optimizer(self.optimizer)

        return {"loss": loss}

    def evaluate_batch(self, batch: TorchData) -> Dict[str, Any]:
        # imgs, bboxes, classes = batch
        batch = list(batch)
        inference_model = FCOSDetector(mode="inference")
        inference_model.load_state_dict(self.model.state_dict())
        validation_loss = self.model(batch)
        validation_loss = validation_loss[-1].mean()
        mAP = evaluate_coco(self.generator, inference_model)[0]

        return {"validation_loss": validation_loss, "mAP": mAP}

Checklist

ioga commented 1 month ago

hello, you should join our community slack.

you can find an invite link in the header of our website.

we had a thread to discuss this issue a few months ago: https://determined-community.slack.com/archives/CV3MTNZ6U/p1711476516177439?thread_ts=1711333054.314639&cid=CV3MTNZ6U