random seed is wrong implementation

zydou commented 7 years ago

I cloned the latest version open-reid (latest commit is a1df21b). First, I run the example code:

python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

The result is:

Mean AP: 15.5%
CMC Scores    allshots      cuhk03  market1501
  top-1           7.1%       12.2%        7.1%
  top-5          23.6%       35.6%       23.6%
  top-10         32.9%       47.3%       32.9%

Then, I run the same code again on the same machine:

python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

The result is:

Mean AP: 15.6%
CMC Scores    allshots      cuhk03  market1501
  top-1           7.9%       13.0%        7.9%
  top-5          20.9%       32.8%       20.9%
  top-10         30.9%       44.8%       30.9%

It's weird that they are different. It seems that these two lines are not work: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/examples/softmax_loss.py#L71-L72 In Dataloader, train_transformer use RandomSizedRectCrop and RandomHorizontalFlip: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/examples/softmax_loss.py#L36-L41 But RandomSizedRectCrop and RandomHorizontalFlip use python built-in random module other than numpy.random. https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/reid/utils/data/transforms.py#L19-L42


class RandomHorizontalFlip(object):
    """Horizontally flip the given PIL.Image randomly with a probability of 0.5."""

    def __call__(self, img):
        """
        Args:
            img (PIL.Image): Image to be flipped.
        Returns:
            PIL.Image: Randomly flipped image.
        """
        if random.random() < 0.5:
            return img.transpose(Image.FLIP_LEFT_RIGHT)
        return img

(Note: RandomHorizontalFlip source code at here)

So in examples/softmax_loss.py , I import random and change:

def main(args):
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

to:

def main(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

Then I run the same example code twice. The results are still different. Then, in reid/utils/data/transforms.py, I change: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/reid/utils/data/transforms.py#L26-L29 to

for attempt in range(10):
    area = img.size[0] * img.size[1]
    target_area = random.uniform(0.64, 1.0) * area
    print(target_area)
    aspect_ratio = random.uniform(2, 3)

Then run the example code twice. The target_area differ in first run and second run, indicating that random.seed(args.seed) is not work. So I rewrite the reid/utils/data/transforms.py with numpy.random. The final reid/utils/data/transforms.py is:

from __future__ import absolute_import

from torchvision.transforms import *
import numpy as np

class RandomHorizontalFlip(object):
    """Horizontally flip the given PIL.Image randomly with a probability of 0.5."""

    def __call__(self, img):
        """
        Args:
            img (PIL.Image): Image to be flipped.
        Returns:
            PIL.Image: Randomly flipped image.
        """
        if np.random.random() < 0.5:
            return img.transpose(Image.FLIP_LEFT_RIGHT)
        return img

class RectScale(object):
    def __init__(self, height, width, interpolation=Image.BILINEAR):
        self.height = height
        self.width = width
        self.interpolation = interpolation

    def __call__(self, img):
        w, h = img.size
        if h == self.height and w == self.width:
            return img
        return img.resize((self.width, self.height), self.interpolation)

class RandomSizedRectCrop(object):
    def __init__(self, height, width, interpolation=Image.BILINEAR):
        self.height = height
        self.width = width
        self.interpolation = interpolation

    def __call__(self, img):
        for attempt in range(10):
            area = img.size[0] * img.size[1]
            target_area = np.random.uniform(0.64, 1.0) * area
            print(target_area)
            aspect_ratio = np.random.uniform(2, 3)

            h = int(round(math.sqrt(target_area * aspect_ratio)))
            w = int(round(math.sqrt(target_area / aspect_ratio)))

            if w <= img.size[0] and h <= img.size[1]:
                x1 = np.random.randint(0, img.size[0] - w + 1)
                y1 = np.random.randint(0, img.size[1] - h + 1)

                img = img.crop((x1, y1, x1 + w, y1 + h))
                assert(img.size == (w, h))

                return img.resize((self.width, self.height), self.interpolation)

        # Fallback
        scale = RectScale(self.height, self.width,
                          interpolation=self.interpolation)
        return scale(img)

Then run the example code twice. The target_area is the same between first run and second run. But the final results (mAP, CMC) are still different. I'm wondering what's wrong with the code. Could you check the code and answer my quesion?

Cysu commented 7 years ago

@zydou Thank you very much for the thorough investigation! I think your modification is correct. I suspect the reason why the final performance is still different is that GPU computation is inherently non-deterministic. Could you please try to run the experiment with single CPU core?

Cysu commented 6 years ago

@zydou You could run with argument -j 0, which will use single thread.

I have tried it myself. When using GPU, I found that the losses of the first several iterations are the same across different trials. But they became different afterwards, and lead to different final results. For example, the first trial could be

Epoch: [0][1/27]  Time 2.252 (2.252)  Data 0.029 (0.029)  Loss 5.377 (5.377)  Prec 0.00% (0.00%)
Epoch: [0][2/27]  Time 0.268 (1.260)  Data 0.022 (0.026)  Loss 5.382 (5.379)  Prec 0.00% (0.00%)
Epoch: [0][3/27]  Time 0.224 (0.915)  Data 0.020 (0.024)  Loss 5.432 (5.397)  Prec 0.00% (0.00%)
Epoch: [0][4/27]  Time 0.259 (0.751)  Data 0.020 (0.023)  Loss 5.431 (5.405)  Prec 0.00% (0.00%)
Epoch: [0][5/27]  Time 0.260 (0.652)  Data 0.020 (0.022)  Loss 5.464 (5.417)  Prec 0.00% (0.00%)
Epoch: [0][6/27]  Time 0.258 (0.587)  Data 0.020 (0.022)  Loss 5.553 (5.440)  Prec 0.00% (0.00%)

While the second trial is

Epoch: [0][1/27]  Time 2.229 (2.229)  Data 0.029 (0.029)  Loss 5.377 (5.377)  Prec 0.00% (0.00%)
Epoch: [0][2/27]  Time 0.273 (1.251)  Data 0.022 (0.026)  Loss 5.382 (5.379)  Prec 0.00% (0.00%)
Epoch: [0][3/27]  Time 0.219 (0.907)  Data 0.020 (0.024)  Loss 5.432 (5.397)  Prec 0.00% (0.00%)
Epoch: [0][4/27]  Time 0.261 (0.745)  Data 0.020 (0.023)  Loss 5.431 (5.405)  Prec 0.00% (0.00%)
Epoch: [0][5/27]  Time 0.259 (0.648)  Data 0.020 (0.022)  Loss 5.463 (5.417)  Prec 0.00% (0.00%)
Epoch: [0][6/27]  Time 0.259 (0.583)  Data 0.020 (0.022)  Loss 5.557 (5.440)  Prec 0.00% (0.00%)

But if using CPU (may need remove the .cuda() and DataParallel in code), it will always lead to the same results. This verifies that GPU computation is inherently non-deterministic.

zydou commented 6 years ago

@Cysu Hi, Tong Xiao. Thanks for your reply! I do a few more experiments below (using numpy.random version transform.py in all experiments):

on CPU: Following your suggestion， I remove the .cuda() and DataParallel in code) and run python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50 Then get the same results each time.
on GPU: When running on GPU(not remove .cuda() and DataParallel) , the results are different as talked above. But when setting the batch size to 1, it will also lead to the same results. For example: python examples/softmax_loss.py -d viper -b 1 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

So I don't agree with

GPU computation is inherently non-deterministic.

But I can't explain why this happen. Do you know the reason? Thanks a lot!

Cysu commented 6 years ago

@zydou I mean some of the cuda kernels that used by cudnn or torch C-implementation could be non-deterministic. One reason could be floating number addition is not associative. You can try in python 0.7 + 0.2 + 0.1 == 0.7 + 0.1 + 0.2. It will print False. This implies that the reduce Op with multiple threads / processes is non-deterministic.

When setting batch size to 1, I suspect there is no need to call the reduce Op. And thus lead to the same result.

zydou commented 6 years ago

@Cysu Thanks a lot!

Cysu / open-reid

random seed is wrong implementation #21