Image chat project related

shubhamagarwal92 commented 4 years ago

@klshuster, Thank you for merging my PR. I have some questions related to the image_chat project:

Please confirm if the resnext model used (in the baseline paper) is resnext101_32x48d_wsl.
I couldn't find faster rcnn related code in the repo (used in All-in-One Image-Grounded Conversational Agents). Not even in this PR. Are you planning to release the code/rcnn-bottom-up features? Did you use Pythia for this?
Is it possible to release a zip version of the YFC100M images used in this project? Or could you verify if this image (hash: ac80c5633d76c27b352ee6352ddbb3.jpg) exists? I also tried to manually download (same format as this), however I got 404:

Screen Shot 2020-02-07 at 6 52 38 PM

Are you planning to release the code for the follow-up work (All-in-One Image-Grounded Conversational Agents and the dodecathlon paper)? Is this sufficient to reproduce all results?

shubhamagarwal92 commented 4 years ago

Another one (404 not found):

wget https://multimedia-commons.s3-us-west-2.amazonaws.com/data/images/ac8/f9f/ac8f9ff369308ac4d3643d3114c6718b.jpg -P data/yfcc_images/

@klshuster Otherwise, could you please let me know where in the code can I ignore these hashes to reproduce the baseline results. I have been trying to use the released pre-trained model and run it in the eval phase as:

python examples/eval_model.py -mf zoo:image_chat/transresnet_multimodal/model -t image_chat --yfcc_path data/yfcc_images/ -dt valid

klshuster commented 4 years ago

Hi @shubhamagarwal92, my apologies for the delayed response.

The model in the paper is a variation of the resnext1010_32x48d_wsl model, however it is not specifically this one.
We do not currently have plans to release the faster rcnn code - we did use the Pythia repository for computing these features (specifically this script)
Unfortunately we do not have plans to release a zip of the images; indeed the dataset is somewhat fluid in that images can be taken down at any moment, however given the small amount of missing images the effect on model performance is minimal.
The code you have linked is the model architecture we used in the dodecathlon paper. We have future plans to release our pre-trained models, along with the remaining tasks that are not in ParlAI.

Regarding your question about ignoring the bad hashes - one way to fix this would be to compile the list of hashes, and then in the _setup_data function for the teacher (here) just iterate through and remove examples with image hashes for which you do not have the image.

Hope that answers your questions.

dexterju27 commented 4 years ago

Hey! @shubhamagarwal92 More info on the faster RCNN features: For the faster RCNN features, we used the pythia script and used their Visual Genome pretrained ResNext backbone faster RCNN. You can download that model from Pythia GitHub page, (please click on their v0.4 branch for more information if you need). The difference in performances between ResNext / ResNet backbone faster RCNN are negligible.

shubhamagarwal92 commented 4 years ago

@klshuster Thank you for your detailed response and all the pointers!

I am guessing even with resnext1010_32x48d_wsl I should be able to reproduce almost the same baseline results? Anyways, I can try with different models here.

For the teacher function in agents, would this trick be able to support examples/interactive.py or eval_model.py or train_model.py or display_model.py? Currently, everything was breaking because of missing hashes. I am not sure which of these APIs call the teacher. Is there any other place where data is being loaded? Thank you again for pointing me to this code. :)

shubhamagarwal92 commented 4 years ago

@dexterju Many thanks! :)

klshuster commented 4 years ago

You should at the very least get results that are no worse than the ResNet152 results listed here.

That trick should solve most of those issues - if you find you're running into another issue please let me know.

shubhamagarwal92 commented 4 years ago

@klshuster I tried to follow your trick and tried to ignore some hashes in agents.

python examples/eval_model.py -mf zoo:image_chat/transresnet_multimodal/model -t image_chat --yfcc_path data/yfcc_images/ -dt valid

~~However, I think it is not getting called.~~

~~print("Ignoring hash list now") ignore_hash_list = self.get_ignore_hash_list(data_path, ignore_hash_list_filename) self.data = self.ignore_hash_json(self.data, ignore_hash_list)~~

~~Do you have any suggestions? Could you please verify the args for examples/eval_model.py. Am I missing anything?~~

UPDATE: Sorry, please ignore this comment. I had two versions of ParlAI repository and the parlai package was installed through a different repo than my current working repo via python setup.py develop in my conda environment. So, this was calling the script from the other repository.

klshuster commented 4 years ago

Could you please paste the exact error you are getting? and also perhaps more context for where you put the code you put above?

shubhamagarwal92 commented 4 years ago

Please ignore the above message. I am able to ignore the hashes in the agent as you suggested and successfully run the code in eval mode as:

python examples/eval_model.py -mf zoo:image_chat/transresnet_multimodal/model -t image_chat --yfcc_path data/yfcc_images/ -dt test

This was able to successfully download the pre-trained model:

[ downloading: http://parl.ai/downloads/_models/image_chat/transresnet_multimodal.tgz to /scratch/shubham/projects/image_chat/pvt/data/models/image_chat/transresnet_multimodal/transresnet_multimodal.tgz ]
Downloading transresnet_multimodal.tgz: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.27G/1.27G [01:33<00:00, 13.5MB/s]
unpacking transresnet_multimodal.tgz

Could you please let me know how to interpret these results, compared to Table 2 in the arxiv paper:

[ Finished evaluating tasks ['image_chat'] using datatype test ]
{'exs': 29982, 'accuracy': 0.015342538856647322, 'f1': 0.07288192608310763, 'bleu-4': 0.014628667324294827, 'hits@1': 0.015342538856647322, 'hits@5': 0.06610633046494563, 'hits@10': 0.12224001067307051, 'hits@100': 1.0, 'first_round': {'hits@1/100': 0.01351, 'loss': -1.0, 'med_rank': 48.0}, 'second_round': {'hits@1/100': 0.01571, 'loss': -1.0, 'med_rank': 46.0}, 'third_round+': {'hits@1/100': 0.01681, 'loss': -1.0, 'med_rank': 47.0}}

Do you differentiate between hits@1 and R@1? Do I need to multiply by 100 (taking note from R@100 output as 1)?
So turn-1 R@1 = 1.3? Turn2 R@1=1.57 and Turn3 R@1=1.6?
While Bleu4 is not reported in the baseline paper, they match similar to the followup work. Should I consider accuracy and f1 reported in decimals and also multiply them by 100?
Is med_rank == Median Rank? Are these numbers reported in any paper?
'hits@1': 0.015342538856647322 - This corresponds to the average R@1 over all the turns?

Thanks again. Sorry for troubling you with all the questions.

klshuster commented 4 years ago

hits@1 and r@1 are equivalent - just multiply by 100 to compare results in the paper.
yes, everything is multiplied by 100
med_rank is median rank, and is reported for the first turn in the appendix of the Personality-Captions paper
The numbers reported in the outer report dictionary are indeed averaged across all turns/examples in the test set.

Hope that helps answer your questions.

stephenroller commented 4 years ago

Thanks for asking the questions @shubhamagarwal92, it's great for the community to have these sort of clarifications.

Closing since everything looks finished here, but don't hesitate with any follow ups.

shubhamagarwal92 commented 4 years ago

Hi @stephenroller @klshuster

Could you please let me know the hyperparameters used for the baseline models.

I used the following command to train the model:

python parlai/scripts/train_model.py \
-m projects:image_chat:transresnet_multimodal \
-t image_chat \
--yfcc_path ${YFCC_DIR} \
-bs 512 \
-mf ${MODEL_SAVE_DIR} > ${MODEL_SAVE_DIR}/logs.txt

Please see the logs here: logs.txt

The default model seems to use 2 layers and 2 heads as specified in the attached logs.

I also tried to reproduce the results with the command:

python examples/eval_model.py -mf zoo:image_chat/transresnet_multimodal/model -t image_chat --yfcc_path data/yfcc_images/ -dt test

Is this the reason that the results reported in my previous comment were turn-1 R@1 = 1.3, Turn2 R@1=1.57 and Turn3 R@1=1.6

Thanks.

klshuster commented 4 years ago

Hi @shubhamagarwal92

The hyperparameters for the reference model can be found in data/models/image_chat/transresnet_multimodal/model.opt, however I will paste the relevant ones (i.e. ones that may differ from their default values) below to save you the trouble of parsing that json dict:

--n-layers 4 \
--embedding-size 300 \
--ffn-size 1200 \ 
--relu-dropout 0.2 \
--n-heads 6 \
--n-positions 1000 \
--variant aiayn \
--activation relu \
--truncate 64 \
--hidden-dim 500 \
--num-layers-all 2 \
--learningrate 0.005 \
--additional-layer-dropout 0.2 \
--validation-patience 10 \
--validation-every-n-epochs 1 \
--image-mode resnet152

Additionally, due to a slight bug in the parameter setup for ParlAI, you'll need to specify --image-mode resnet152 when evaluating the pre-trained model; that should yield appropriate results (without that, the model sees no images)

shubhamagarwal92 commented 4 years ago

Hi @klshuster

Thank you for clarifying and sharing the hyperparams! Indeed after the image_model flag, I am able to reproduce the baseline results. :)

I want to confirm some of the other doubts:

Is there any difference of using the model (mf) from zoo or models for eval mode like: -mf zoo:image_chat/transresnet_multimodal/model vs -mf models:image_chat/transresnet_multimodal/model?
For train_model.py, am I using the correct flag -m projects:image_chat:transresnet_multimodal? ( I couldnt find and image chat agent in parlai.agents folder.)
num_epochs is set to -1 in the model.opt. What is the stopping criteria that is used? Do you have an estimate as to how many epochs the models were trained on?

Some ParlAI related questions:

Is it possible to use multiple-gpus or specific gpu_ids while training (eg. gpu 2,3)? I couldn't find any gpu arg in model.opt. If I want to add this argument, what should be the right place? Like here?

If I want to extend my own agent, should it be done in projects.image_chat.my_model directory? where I can have something like:


# same imports as /projects/image_chat/transresnet_multimodal/transresnet_multimodal.py
class MyAgent(TransresnetMultimodalAgent):


and called as `-m projects:image_chat:my_model`

klshuster commented 4 years ago

no difference, zoo and models return the exact same thing (just depends on preference 😄)
yes i believe that is the right agent
as indicated in my previous comment, we set --validation-patience 10 --validation-every-n-epochs 1, i.e., if validation accuracy does not improve for 10 epochs, we stop training
Here is one place where we add a gpu arg to specify devices; you can add that to the agent where you specified if you like
You might be better off implementing either in your own projects directory (e.g. projects:new_project:my_model or in the parlai agents directory (where you can then just specify -m my_model.

shubhamagarwal92 commented 4 years ago

Thanks a lot for all the help! :)

shubhamagarwal92 commented 4 years ago

Hi @klshuster,

Hope you are holding up well! Sorry to bother you again.

Even though the pre-trained model can reproduce the results in the eval mode, the training command still cannot replicate the results. PFA both eval_logs.txt and train_logs.txt

I tried with image_mode flag as well as all the hyperparams suggested. Also, PFA the command:

export MODEL_SAVE_DIR=${MODEL_DIR}/reproduce/

python parlai/scripts/train_model.py \
-m projects:image_chat:transresnet_multimodal \
-t image_chat \
--yfcc_path ${YFCC_DIR} \
-bs 256 \
--image-mode resnet152 \
--n-layers 4 \
--embedding-size 300 \
--ffn-size 1200 \
--relu-dropout 0.2 \
--n-heads 6 \
--n-positions 1000 \
--variant aiayn \
--activation relu \
--truncate 64 \
--hidden-dim 500 \
--num-layers-all 2 \
--learningrate 0.005 \
--additional-layer-dropout 0.2 \
--validation-patience 10 \
--validation-every-n-epochs 1 \
-mf ${MODEL_SAVE_DIR}/basic_model > ${MODEL_SAVE_DIR}/train_logs.txt

Could you please suggest what I am missing?

PS. A suggestion for the ParlAI general documentation about the naming convention:

a. If we want to create own agent in agents directory, we should have the class name exactly as MyModelNameAgent in the agents.my_model_name.my_model_name.py (with a strict directory structure)

b. However, if we want to create as projects:new_project:my_model_name we have to follow the directory structure as projects.new_project.my_model_name.my_model_name.py with the class name exactly as MyModelNameAgent

Loader matches on this exact naming convention here. Even capitalization in the class name such as MYModelNAmeAgent could mess things up. This should be explicit in the parrot example.

Thanks again for your help!

klshuster commented 4 years ago

RE: training...

One important thing to note is that the context and candidate encoders in the pre-trained model were themselves pre-trained (see section 4.1 in the paper for more discussion about the "Dialogue Encoder"). We do not currently have plans to release these specific pre-trained encoders, though there are a number of other pre-trained Transformer encoders you can find in the ParlAI Model zoo.

Your notes on naming convention are accurate, and the docs should probably reflect these. Note that you can always specify the agent name by typing -m my_model_name:MyModelNameAgent or -m projects:new_project:my_model_name:MYModelNAmeAgent, which will allow you to name your Agent whatever you'd like 😄

shubhamagarwal92 commented 4 years ago

@klshuster

Thanks for your reply. The difference in performance is too stark:

After training the model, results on test set:

'accuracy': 0.0105396571276099, 'f1': 0.05519516050648507, 'bleu-4': 0.01001851816223969, 'hits@1': 0.0105396571276099, 'hits@5': 0.05203121873123874, 'hits@10': 0.10069374958308318, 'hits@100': 1.0

For running it only in evaluation mode (on test):

'accuracy': 0.4058435061036622, 'f1': 0.44580235298830806, 'bleu-4': 0.39437274252183546, 'hits@1': 0.4058435061036622, 'hits@5': 0.6724701487559203, 'hits@10': 0.779100793809619, 'hits@100': 1.0

Hits@5 is 5.2 when training from scratch compared to 67.2 when directly using in eval mode. Is there any way to train a model and have results in a decent ballpark - like at least 65 for Hits@5?

Do you think I am still missing any argument to reproduce the results while training? Could you suggest any pre-trained encoder in the zoo and elicit the flags to pass it to the model?

Thanks.

klshuster commented 4 years ago

A couple ideas that may help you get better results:

Perhaps try small sweep over the learning rate; given that your encoders are randomly initialized, it might be good to try larger learning rates.
Try toying a bit with the model size, varying the number of layers/attention heads.

shubhamagarwal92 commented 4 years ago

@klshuster Thanks for the suggestions. I am already trying different hyperparams but it seems like a big gap to cover just through hyperparam optimization.

But as you earlier suggested, could you show how to use a pre-trained encoder from the zoo in ParlAI?

klshuster commented 4 years ago

Any transformer-based model in the zoo would work with these encoders; you would just need to massage the state dicts and load accordingly (I do not have specific steps).

facebookresearch / ParlAI

Image chat project related #2394