Sucran commented 5 years ago

Hi， @Ugness I met a RAM memory leak problem when running network.py and train.py， this issue confused me for a few days. I have run other pytorch repo which is OK. I run the code in Ubuntu 14.04, Pytorch 0.4.1, CUDA 8.0, cudnn 6.0.

Ugness commented 5 years ago

Yes I also met memory leak problem. In my case, my GPU's VRAM is 11 GB, and It spends about 9GB of VRAM for batch size 1. (in paper, batchsize 10 is recommended but I could not train the model with that option) (for 3 * 3 convolution version)

May be 1 * 1 convolution version(branch adjusted) would spend less memory.

I think that memory leak comes from local Picanet's implementation. It makes H W C Tensor to H W number of 14 14 * C size patches. If you have better idea to implement local Picanet, please comment here or make a pull request.

Since I am not the author of paper, this code is not the best implementation. I'm sorry for that.

Sucran commented 5 years ago

Yes, I noticed the batch size option, it is wired and strange. I have no better idea so far, but I hope for further discussion. This week, I will go through the caffe code from author and see the difference in implementation in pytorch and caffe, going deeper in local PiCANet and global PiCANet.

Ugness commented 5 years ago

Can you give me the link of caffe implementation? I didn't know that. Thanks.

Sucran commented 5 years ago

@Ugness https://github.com/nian-liu/PiCANet, with deeplab caffe version.

Ugness commented 5 years ago

Thanks a lot.

Sucran commented 5 years ago

@Ugness I change the PiCANet config from ‘GGLLL’ to ‘GGGGG’ and ‘LLLLL’, both of them have memory leak problem when running network.py, have you met this before? I also found an interesting implementation of authors caffe code, they seem implemented an attpooling function on their own proto cpp which support their global or local attention function like conv3d. Can you give me a hint on how you thinking about the conv3d processing?

Ugness commented 5 years ago

I think that would not work with 'GGGGG' or 'LLLLL'. I just tested with 'GGLLL' and other options may cause some tensor dimension error.

And I will check protocpp ASAP. For my conv3d processing, it is not easy to describe with text only. :( I will describe with text first, but if you need more information to help your understanding, I'll make some images ASAP.

Ugness commented 5 years ago

How Conv3d works?

Assumption

Lets say Image_batch's shape is (N x C x H x W).
and attention map( of each pixel position)'s shape is (h x w) (for global, h=H, w=W).
We have (H * W) number of attention maps.
Each attention map should be applied to each pixel's patch, every channel.
I think to make this process with for loop takes a lot of time
(I don't know how to use CUDA level for loop, and I heard that default for loop works on cpu),
so I tried to use pytorch's pre-implemented Convolution functions.

What's difference between convolution and 'PiCA' process?

Convolution applies same kernel to each patch(pixel location), but different kernels to each channel, sample(batch_size).
PiCA process applies same att_map to each channel, but different att_map to each patch(pixel location), sample(batch_size).

PiCA process with Conv3d (Main Idea of method)

On image side, my idea is send the dimension of batch and location to dim:1 (channel).

(1, NxHxW, C, 13, 13) -> for F.conv3d, each dimension means (batch, channel, depth, H, W)

On kernel side, my idea is each (1,1,7,7) kernel goes to (1,1,13,13) by using F.conv3d dilation option.
Then, F.conv3d will apply NxHxW number of kernels to NxHxW number of patchs. It is possible by using groups option
Also, F.conv3d will across the depth dimension(C, dim:2) with same att_map.
Finally, the output is (1, NxHxW, C, 1, 1) attention applied feature map, so I can reshape it to (N, C, H, W)

I used same idea to local PiCANet.

Ugness commented 5 years ago

For conventional Pytorch's Conv3d
My use of Conv3d

Ugness commented 5 years ago

X_X
I thought Caffe is simillar to pytorch, but it wasn't.
I tried to read the code, but I can't. The only thing I can see is they used for loop.
If they used for loop for implementing PiCANet, for loop in python consumes a lot of time. without CUDA logic. And I don't know how to use CUDA for loop in python. T.T

Sucran commented 5 years ago

@Ugness I do not think they use loop for implementing PiCANet. They use im2col and col2im, which is torch.nn.Unfold and torch.nn.Fold in pytorch. I suppose Conv3d can be translated into a combination of several im2col + matric multiplication + col2im, but I still confused how to implement this, still working on it. The memory leak problem we suffered seems caused by F.Conv3d, hoping next version would fix it.

Ugness commented 5 years ago

Thanks. I also try to convert conv3d operation to combination of matrix multiplication.

Ugness commented 5 years ago

@Sucran I think I can improve my model soon. There was no such function like torch.nn.Fold on pytorch 0.4.0 when I started this project. Now, I found the function that I need. Thanks.

Sucran commented 5 years ago

Oh, really？Amazing！ @Ugness You are such a genius. Looking forward to your new version. Thanks for your work, again.

Ugness commented 5 years ago

Hi @Sucran I made a new logic! You can check it on https://github.com/Ugness/PiCANet-Implementation/tree/Fold_Unfold Now you can train PiCANet model with batch1, by using 3.5GB of VRAM. I just started my training code, so I'll report the training result about next week!

Looks like it works!

Sucran commented 5 years ago

@Ugness Soooooo happy for it works! I check the branch of Fold_Unfold, the memory leak problem seems gone. The VRAM is also lower for increasing the batch size, but cannot be 10. I will check the channel setting of each layer by comparing the caffe version of the author, maybe there is something misunderstanding still exits.

Ugness commented 5 years ago

@Sucran Thanks a lot for your interest. It gave a lot of improvement. It seems like training speed is also improved. About the version of code, Fold_Unfold version is branch of origin (33 conv) not the Adjusted(11 conv) one. I am training this code with 3*3 conv, batch_size 4. I am going to close this Issue after report the result. If you find some errors or need help, please open another issue. :)

Sucran commented 5 years ago

@Ugness Ok. Thanks for your work again. It is my pleasure.

Sucran commented 5 years ago

@Ugness Anything new?

Ugness commented 5 years ago

One of my model got about 88 on F-measure score with 200 samples of DUTS-TE which scored 87 with model in paper, So I am measuring score with all of DUTS-TE, on all of checkpoints. So it takes a little bit long time.

I ensure that new model(with bigger batch_size) performs much better. I think I can update repo on Sunday or next Monday.

Ugness commented 5 years ago

I updated and merged branch.

Sucran commented 5 years ago

@Ugness So the result is the branch of origin (33 conv) not the Adjusted(11 conv) one? it seems to increase the performance of the author's version? The curve you plot is corresponding to training or validation?

Ugness commented 5 years ago

No, it's adjusted one. I used (1*1 conv). Yes it seems making better performance. The curve is validation.

I think I need to check all of the code hardly. May be there is something wrong.

Sucran commented 5 years ago

@Ugness Hi, I try to reproduce your result, but I am confusing how to compute the metric result you reported. I had a trained weight model, but which code file contains the test part code?

Ugness commented 5 years ago

You can check the measuring code in pytorch/measure_test.py. It will report the result on tensorboard, and you can download csv from tensorboard.

Sucran commented 5 years ago

Hi @Ugness, do you check your test code for computing Max F_b and MAE, I think there are problems here. 1) The way of computing MAE which is different with MSE_Loss. It is torch.abs(y_pred - y).mean(). 2) I do not familiar with scikit-learn, maybe the pr_curve computing is more efficient than handcraft one. but I got a different result, I ref the code of AceCoooool/DSS-pytorch, I think the problem can be here. I using the trained model 36epo_383000step.ckpt and got a Max F_b as 0.877 for your code, but got 0.750 for AceCoooool's code.

Ugness commented 5 years ago

Ops. I found that MSE and MAE is not same. It's my mistake. I'll fix it.
I'll check how scipy measures F-beta score. I used threshold to measure F-beta score, may be that was wrong.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve

For example, if threshold=0.7 and predicted value=0.8, I made 0.8 to 1. As like as making PR-curve. And I plot Max-F score on all threshold space. If I did not use threshold, maybe it scored 0.75 as like you get. Thank you for your comment. I also think 0.877 is strange. Because my attention map was different from author's one.

Sucran commented 5 years ago

@Ugness I do not think the scikit-learn API provided a correct way to compute max F-beta, but you can ref the paper "Salient Object Detection: A Survey" for Chapter 3.2. Usually, we have a fixed threshold which changes from 0 to 255, for binarizing the saliency map to compute precision and recall. F-beta is computed from the average precision and average recall of all images. Then we pick the maximum as max F-beta.

Ugness commented 5 years ago

Thanks. I'll check the paper.

Ugness commented 5 years ago

I found that my threshold is not same as Survey Chapter 4.2. I am going to re-measure the f score as soon as possible. Thanks.

Ugness commented 5 years ago

I found that sklearn uses more specific threshold bins than 0 to 255. sklearn uses all possible values of each pixel as threshold. So I used the evaluation code in DSS repo and fixed it little bit because it made NaN problem.

        for i in range(256):
            y_temp = (pred >= thlist[i]).float()
            tp = (y_temp * mask).sum()
            # avoid prec becomes 0
            prec[i], recall[i] = (tp + 1e-10) / (y_temp.sum() + 1e-10), (tp + 1e-10) / (mask.sum() + 1e-10)
        f_score = (1 + beta_square) * prec * recall / (beta_square * prec + recall)
        print(torch.max(f_score))

Upper one is sklearn, and below is DSS_repo's evaluation method. I need to wait for a while to get a full result, but It seems there is no big diff between sklearn's one and DSS repo's one.

Sucran commented 5 years ago

@Ugness Sorry, I found my test code had some mistake. I will report a new result these days of your model.

Ugness commented 5 years ago

It's okay. Thanks.

Sucran commented 5 years ago

I fix my test code，and I got the result of 36epo_383000step.ckpt on DUTS-TE average mae: 0.0757, max fmeasure: 0.7803 if I use denseCRF，the result is average mae: 0.0639, max fmeasure: 0.7886 on DUTS-TE all test image resizes into 224*224 because of the dimension limit of attention module. Hope you can check the answer.

Ugness commented 5 years ago

Can I get a code snippet from your test code? and also Can I get the threshold value when you get the max fmeasure score?

Ugness commented 5 years ago

Hi, I uploaded result csv on https://drive.google.com/drive/u/0/folders/1A9qXGuvtqwSY0mEc5hbC-4b7ix8fLyfA

Sucran commented 5 years ago

the test code:

    def eval_mae(self, y_pred, y):
        return torch.abs(y_pred - y).mean()

    def eval_pr(self, y_pred, y, num=100):
        prec, recall = torch.zeros(num), torch.zeros(num)
        thlist = torch.linspace(0, 1 - 1e-10, num)
        for i in range(num):
            y_temp = (y_pred >= thlist[i]).float()
            tp = (y_temp * y).sum()
            prec[i], recall[i] = tp / (y_temp.sum() + 1e-20), tp / y.sum()
        return prec, recall

def test(self, use_crf=False):
        if use_crf: from libs.dense_crf import crf
        avg_mae, img_num = 0.0, len(self.test_loader.dataset)
        avg_prec, avg_recall = torch.zeros(100), torch.zeros(100)
        self.net.eval()
        with torch.no_grad():
            for i, data_batch in enumerate(self.test_loader):
                images, labels = data_batch['image'], data_batch['label']
                images, labels = images.to('cuda'), labels.to('cuda')
                shape = labels.size()[2:]
                new_shape = (shape[0] // 32) * 32, (shape[1] // 32) * 32
                inputs = F.interpolate(images, size=new_shape, mode='bilinear', align_corners=True)
                prob_pred = self.net(inputs)
                prob_pred = torch.mean(torch.cat([prob_pred[i] for i in self.net.select], dim=1), dim=1, keepdim=True)
                prob_pred = F.interpolate(prob_pred, size=shape, mode='bilinear', align_corners=True).to('cpu')
                if use_crf:
                    prob_pred = crf(images, prob_pred.numpy(), to_tensor=True)
                labels, prob_pred = labels.to('cpu'), prob_pred.to('cpu')
                mae = self.eval_mae(prob_pred, labels)
                prec, recall = self.eval_pr(prob_pred, labels)
                print("[%d] mae: %.4f" % (i, mae))

                avg_mae += mae
                avg_prec, avg_recall = avg_prec + prec, avg_recall + recall
        avg_mae, avg_prec, avg_recall = avg_mae / img_num, avg_prec / img_num, avg_recall / img_num
        score = (1 + (0.3) ** 2) * avg_prec * avg_recall / ((0.3) ** 2 * avg_prec + avg_recall)
        score[score != score] = 0  # delete the nan
        print('average mae: %.4f, max fmeasure: %.4f' % (avg_mae, score.max()))

the crf code ref to AceCoooool/DSS-pytorch

Ugness commented 5 years ago

I think that F.interpolate() made the difference. I simply resize all image to 224*224 when I load data from the dataset without maintaining their aspect ratio. If that resizing method is wrong, I think I need to train the model again with your resizing method. Please give me your comment. Thanks.

Sucran commented 5 years ago

yes，for your network，I set the resize transformation into 224*224 in the dataset, this code is for my own network, my own network accepts all input size. So, I think the resizing is not work when I test your model. The BIG difference is the way of computing precision and recall, you can check it. The MAE is also strange, did you change your MAE code? Last time you said you made mistake in here.

Ugness commented 5 years ago

Yes, I corrected MAE. Oh, I check the difference on precision and recall now. Sorry. I'll test it again.

Sucran commented 5 years ago

@Ugness Sorry, I was busy on my own thing these days. Have you test it again and determine which is the right answer? Actually, I still can not run the code for a complete training phase since RAM was eaten up on my machine. The model I test is download on your link, I'm afraid the model is not the newest. Can you upload a new model for testing?

Ugness commented 5 years ago

Sorry. now I started to test my code. I'll report the result ASAP

Ugness commented 5 years ago

scored 0.8546. for MAE 0.05321

Sucran commented 5 years ago

@Ugness It may be my mistake, could you tell me the model download link and the corresponding code of model definition?

Ugness commented 5 years ago

https://drive.google.com/drive/folders/1A9qXGuvtqwSY0mEc5hbC-4b7ix8fLyfA

I think you tested with the latest model. And there is no update of model definition since Oct. 21. I'm still not sure how to calculate F-score correctly. I'll order my procedure and please check it.

For each threshold in linspace(0, 1, 256)
1. get prediction from model, mask from data.
2. calculate precision and recall for each image.
3. make average precision and average recall over all data.
4. calculate f_score with avg precision and avg recall.
5. pick the maximum f_score over all threshold.

Corresponding code of measuring F-score is here.

for model in models:
    for i, batch in enumerate(dataloader):
        img = batch['image'].to(device)
        mask = batch['mask'].to(device)
        with torch.no_grad():
            pred, loss = model(img, mask)
        pred = pred[5].data
        mae += torch.mean(torch.abs(pred - mask))
        pred = pred.requires_grad_(False)
        preds.append(pred)
        masks.append(mask)
        prec, recall = torch.zeros(mask.shape[0], 256), torch.zeros(mask.shape[0], 256)
        pred = pred.squeeze(dim=1).cpu()
        mask = mask.squeeze(dim=1).cpu()
        thlist = torch.linspace(0, 1 - 1e-10, 256)
        for j in range(256):
            y_temp = (pred >= thlist[j]).float()
            tp = (y_temp * mask).sum(dim=-1).sum(dim=-1)
            # avoid prec becomes 0
            prec[:, j], recall[:, j] = (tp + 1e-10) / (y_temp.sum(dim=-1).sum(dim=-1) + 1e-10), (tp + 1e-10) / (mask.sum(dim=-1).sum(dim=-1) + 1e-10)
        # (batch, threshold)
        precs.append(prec)
        recalls.append(recall)

    prec = torch.cat(precs, dim=0).mean(dim=0)
    recall = torch.cat(recalls, dim=0).mean(dim=0)
    f_score = (1 + beta_square) * prec * recall / (beta_square * prec + recall)
    thlist = torch.linspace(0, 1 - 1e-10, 256)
    writer.add_scalar("Max F_score", torch.max(f_score),
                      global_step=int(model_name.split('epo_')[1].split('step')[0]))
    writer.add_scalar("Max_F_threshold", thlist[torch.argmax(f_score)],
                      global_step=int(model_name.split('epo_')[1].split('step')[0]))

Ugness commented 5 years ago

And about your memory problem, how much VRAM and RAM do you have? Where does the RAM problem occur? RAM or VRAM?

Sucran commented 5 years ago

@Ugness I think the F-score procedure code that you showed is correct. It is almost the same as I reported in 7 days ago, right? I just set the number of threshold as 100 and you set it as 256, which no cause too many differences, but the result still be 0.854 when you tested?
The most strange thing hit on my mind is the difference of MAE results. I always got a value of 0.65 but you got 0.54, we test it with the same code. Oh man, it is wired!

Ugness commented 5 years ago

I also think it is strange. And I have a few questions to compare our results.

Did you use all of DUTS-TE for testing?
How many images in your DUTS-TE folder? There was a few mismatching files which should be deleted in DUTS-TE.
Can you give me the threshold value which you used for the score 0.7803?
Did you explicitly round (convert to binary image) the mask images in DUTS-TE?
Did you use GPU for testing? (Did you use cuda mode with a single GPU?) Thanks for sharing your results. It improves this project a lot.

Sucran commented 5 years ago

all DUTS-TE
I found this problem but these files do not cause a big difference.
Nope, I have no print this threshold. I will test it again.
Convert the mask images into range [0,1], this is done automatically.
Yes, single GPU for testing, cuda mode.

Ugness commented 5 years ago

Thank you for answering. for 3., If you want, can you test with my threshold option? I already have it. It is 0.6627.

Ugness / PiCANet-Implementation

Have you met memory leak problem when running model？ #9

How Conv3d works?

Assumption

What's difference between convolution and 'PiCA' process?

PiCA process with Conv3d (Main Idea of method)