Open HumanZhong opened 3 years ago
I've sent an email to the authors and received their reply. To help researchers who want to conduct reimplementation, here are their responses:
Besides, I've made an attempt to reimplement the vision part of this model, but achieved a much worse result. My average precision of the vision model is about 85+%, while the performance reported in the paper is 88+%.
If anyone has also tried to reimplement or has achieved higher performance, please leave a message here on how you make it if possible, that will be great help and big thanks in advance.
Use parallel attention for vision model, still cannot hit 88.8% mentioned in paper.😅
How about your accuracy now. Do you use Transformer as the sequence modeling layer, and use data augmentation? We'll release our code next week or so.
Thanks for your great work. I'm facing some trouble on the re-implementation of the vision part of your work, and I'd like to ask for more experimental details for re-implementation if it is possible.
I noticed that in SRN, they used resnet50 as their backbone. While in your ABINet, you chose a much more light-weight backbone with only 5 residual blocks(it seems like a resnet18 or even lighter) for feature extraction(according to your arxiv paper's footprint) and you still achieved comparable results. Can you provide the detailed structure of your resnet backbone as well as the mini-Unet structure? Besides, can you provide the different configurations of your SV(small vision model), MV(medium vision model) and LV?
Is the positional encoding and the order embedding(used as Q in attention) hard-coded or learned? Does different encoding methods affect the performance a lot?
Can you provide the detailed parameters for augmentation methods? And How much does it affect the performance with and without data augmentation?
How long approximately does it take for your model to reach convergence using 4x1080ti gpus?
Thanks again for your work, looking forward to your reply.