Vanishing gradient issue in APN

I am trying to re-implement this experiment in pytorch. However, weights of APN(Attention Proposal Network) aren't updated because of extremely low gradients. I think this issue is from logistic function of eq(5). It looks like a flat region of logistic function makes gradients almost zero.

In the paper, authors pretrained APN using last cnn features. Did you record the performance without this initialization?

Thank you.

Jianlong-Fu / Recurrent-Attention-CNN

Vanishing gradient issue in APN #13