[Question] Reconstructed pixels are discontinuous at patch boundaries.

facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Other

6.93k stars 1.17k forks source link

[Question] Reconstructed pixels are discontinuous at patch boundaries. #153

Open LeroyChou opened 1 year ago

LeroyChou commented 1 year ago

Hi, thank you for this great work!

I test the code on my custom dataset, and find that the pixels on reconstructed image (before pasting visbile patches on it) is not continuous on the patch boundaries. So it looks like there are many (e.g. 14x14=196) grids on the image. May I know how to solve it?

Kind regards.

stha-prashant commented 1 year ago

@LeroyChou, I'm facing the same problem on my custom dataset, did you figure out how to solve it?

daisukelab commented 1 year ago

@LeroyChou @stha-prashant Are you pre-training with --norm_pix_loss option? If so, try without using it.

LeroyChou commented 12 months ago

@LeroyChou, I'm facing the same problem on my custom dataset, did you figure out how to solve it?

Not really. I tried several methods, but they look useless. But if you spend more time on training, the reconstructed images will be better.

LeroyChou commented 12 months ago

@LeroyChou @stha-prashant Are you pre-training with --norm_pix_loss option? If so, try without using it.

Thank you for replying. But I didn't train with --norm_pix_loss.

daisukelab commented 12 months ago

@LeroyChou Then there might be some different causes.

FYI, I'd share my results. I guess yours might look different from ours. We have confirmed MAE reconstruction using an audio spectrogram in two of our repos; one doesn't use the option, and another uses it.

When not using the norm_pix_loss, we see the reconstruction clean. --> https://github.com/nttcslab/msm-mae/blob/main/misc/Note_visualization.ipynb

(The upper row is the input, lower is the reconstruction. The white square is a visible patch, and non-white square patches are the reconstructed ones.)

When using the norm_pix_loss, we see patches are discontinuous at their edges; simply because MAE has learned from normalized patches. Please note that each patch is normalized independently, simply resulting in discontinuous edges. --> https://github.com/nttcslab/m2d/blob/master/Note_viz_msm_mae.ipynb

LeroyChou commented 12 months ago

@daisukelab Your work appears to be quite intriguing!

In fact, our situation differs slightly from yours. Initially, I trained the model using the --norm_pix_loss option, but the reconstructed image was unsatisfactory -- it was filled with colored noise. I did not save the image, but it resembled this: The second time, I trained without the--norm_pix_loss option. The colored noise disappeared, but discontinuity on the patch boundaries became apparent.

I am not well-versed in the field of audio spectrograms. May I inquire if you have ever encountered colored noise? What is the range of audio spectrogram data?

daisukelab commented 12 months ago

@LeroyChou No, I have not experienced the colored noise issue, at least with a model finished pre-training. I also have tested on ImageNet (with norm_pix_loss) without having related issues. My spectrograms are standardized to become N(0, 1), mostly in the [-1,1] range.

I guess I remember (vaguely) that the decoder in the very early stage might output a noise-like image.

LeroyChou commented 12 months ago

@daisukelab Thank you for your response. You are correct that the early stage generates noisy images. And as training progresses, the noise disappears. It's quite strange that we are experiencing different situations despite using similar models.I have also noticed that others have encountered same problem with colored noises, as mentioned in this issue: https://github.com/facebookresearch/mae/issues/95. One possibility is that there may be significant differences between our raw data, specifically between audio spectrograms and RGB images. Do you have any thoughts?

daisukelab commented 12 months ago

@LeroyChou I think of nothing, while one apparent difference might be that we handle single-channel data instead of RGB channels in the image. BTW, I found an old reconstruction log when the model was trained after one epoch out of 300. It's a blurry image instead of noises. Let me fix my comment above, and what I have experienced is that MAE has outputted blurry images in the early training epochs instead of noisy ones. (yellows show that they are reconstructed patches.)