facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

MMF training not saving model #501

Closed gireek closed 4 years ago

gireek commented 4 years ago

Problem edited below

vedanuj commented 4 years ago

Hi .. this model is not trained using mmf. So we won't be able to help with this. You can train or use models trained with MMF by following the tutorial or readme.

vedanuj commented 4 years ago

Follow the prerequisites here : https://github.com/facebookresearch/mmf/tree/master/projects/hateful_memes#prerequisites

gireek commented 4 years ago

I tried the following from https://github.com/apsdehal/hm_example_mmf: MMF_USER_DIR="." mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0

also: mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0 run_type=train_val because your docs here: https://mmf.sh/docs/challenges/hateful_memes_challenge listed run_type=train_val option for training but training does Not happen. The logs in the save folder end up like this :

2020-08-20T17:24:02 | INFO | mmf.train : Total Parameters: 59437822. Trained Parameters: 59437822 2020-08-20T17:24:02 | INFO | mmf.train : Starting training... 2020-08-20T17:24:02 | INFO | mmf.train : Loading fasttext model now from /root/.cache/torch/mmf/wiki.en.bin 2020-08-20T17:28:04 | INFO | mmf.train : Finished loading fasttext model and the models folder in the save folder is always empty.

vedanuj commented 4 years ago

Can you show the configs/experiments/defaults.yaml that you are using?

gireek commented 4 years ago

IT is this one: https://github.com/apsdehal/hm_example_mmf/blob/master/configs/experiments/defaults.yaml

gireek commented 4 years ago

This is what comes on the terminal:

2020-08-20 19:18:34.411776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Namespace(config_override=None, local_rank=None, opts=['config=configs/experiments/defaults.yaml', 'model=concat_vl', 'dataset=hateful_memes', 'training.num_workers=0', 'run_type=train_val'])
/usr/local/lib/python3.6/dist-packages/mmf/utils/configuration.py:528: UserWarning: Device specified is 'cuda' but cuda is not present. Switching to CPU version.
  + "Switching to CPU version."
Importing user_dir from /content/hm_example_mmf
Overriding option config to configs/experiments/defaults.yaml
Overriding option model to concat_vl
Overriding option datasets to hateful_memes
Overriding option training.num_workers to 0
Overriding option run_type to train_val
Using seed 36408782
Logging to: ./save/logs/train_2020-08-20T19:18:36.log
Downloading extras.tar.gz: 100% 211k/211k [00:00<00:00, 279kB/s]  
169876453it [05:55, 478358.09it/s]
Downloading: "https://download.pytorch.org/models/resnet152-b121ed2d.pth" to /root/.cache/torch/checkpoints/resnet152-b121ed2d.pth
100% 230M/230M [00:03<00:00, 60.8MB/s]

tcmalloc: large alloc 5423251456 bytes == 0x47eb8000 @  0x7f587bdd2887 0x7f57db499593 0x7f57db490897 0x7f57db4913b1 0x7f57db44e883 0x7f57db471b00 0x566ddc 0x50a783 0x50c1f4 0x507f24 0x509202 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x509015 0x594b01 0x54ac61 0x59fe1e 0x50d596 0x507f24 0x509202 0x594b01
tcmalloc: large alloc 3023249408 bytes == 0x18babc000 @  0x7f587bdd2887 0x7f57db499593 0x7f57db4908db 0x7f57db4913b1 0x7f57db44e883 0x7f57db471b00 0x566ddc 0x50a783 0x50c1f4 0x507f24 0x509202 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x509015 0x594b01 0x54ac61 0x59fe1e 0x50d596 0x507f24 0x509202 0x594b01
^C

Suprisingly the '^C' comes automatically in the last line. I amdoing this on Colab using GPU. The other issue is even on using GPU it gives this:

/usr/local/lib/python3.6/dist-packages/mmf/utils/configuration.py:528: UserWarning: Device specified is 'cuda' but cuda is not present. Switching to CPU version.

gireek commented 4 years ago

Hi @vedanuj I have followed every step here : https://github.com/gireek/facebook_mmf/blob/master/mmf.ipynb on both Colab GPU and TPU but no model gets trained. Please check

vedanuj commented 4 years ago

Try adding this where you are installing the pip package:

!pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
apsdehal commented 4 years ago

Where exactly are you running this? Please provide us output of python -m torch.utils.collect_env

gireek commented 4 years ago

@apsdehal this is on Colab GPU

here is the output :

Collecting environment information...
PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect
apsdehal commented 4 years ago

Did you change your runtime type to GPU?

apsdehal commented 4 years ago

I have tested it after changing runtime to GPU and making sure torch and torchvision versions are installed as @vedanuj suggested and then restarting the runtime. Also, it seems like with FastText it will likely run out of memory. You can try our own colab for this at https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb

gireek commented 4 years ago

Did you change your runtime type to GPU?

Yes I did that.

gireek commented 4 years ago

I have tested it after changing runtime to GPU and making sure torch and torchvision versions are installed as @vedanuj suggested and then restarting the runtime. Also, it seems like with FastText it will likely run out of memory. You can try our own colab for this at https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb

When you tried did it work for you? I think it might be true that Fasttext is too big for Colab RAM specs. Is there a way to load less a lighter version of Fasttext maybe? Something like smaller versions of Glove files? Apologies if this is not possible as I am new to NLP.

apsdehal commented 4 years ago

No, it is working fine. It throws tracemalloc error but works fine afterwards. See https://colab.research.google.com/drive/1s1Yfc0DmU5cuMVaDUokbAlcRTT0Te9fs?usp=sharing.

gireek commented 4 years ago

@apsdehal it stops after the tracemalloc error for me in Colab on your shared notebook, I have kept Runtime as GPU. In your shared notebook also there are no training logs on the terminal. Do they go into the logs file? And secondly in your case too ./save/models in empty. Where does the model get stored then?

apsdehal commented 4 years ago

Are you sure it stopped? It just doesn’t log anything. You can check logs in save/train.log file. You won’t see my local drive or if models are there. Your environment is brand new. It did save the models for me.

gireek commented 4 years ago

Yes @apsdehal it stops and actually the Colab you shared has a ls ./save/models so if anything was saved it should have been listed right. I tried once again with your Colab and saved it here: https://github.com/gireek/facebook_mmf/blob/master/Copy_of_Copy_of_mmf.ipynb

The last line shows that ./save/models is empty. The jupyter notebook was run as it is on Colab.

gireek commented 4 years ago

After it stops i checked the ./save/train.log and these are the last 3 lines:

2020-08-22T04:56:36 | INFO | mmf.train : Total Parameters: 59437822. Trained Parameters: 59437822
2020-08-22T04:56:36 | INFO | mmf.train : Starting training...
2020-08-22T04:56:36 | INFO | mmf.train : Loading fasttext model now from /root/.cache/torch/mmf/wiki.en.bin

Kindly check this notebook: https://github.com/gireek/facebook_mmf/blob/master/Copy_of_Copy_of_mmf.ipynb

apsdehal commented 4 years ago

I didn’t run the last cell. I only checked the execution. I can share the full execution later with you.

apsdehal commented 4 years ago

Update: running longer it runs out of CUDA memory. Btw, same thing is also available in tutorial created by us: https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb. Why don't you try that one?

gireek commented 4 years ago

Doing this now as given in the notebook mentioned above by you:

!MMF_USER_DIR="." mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0 training.log_interval=50 training.max_updates=3000 training.batch_size=16 training.evaluation_interval=500

the logs are now:

2020-08-22T05:56:37 | INFO | mmf.train : progress: 50/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6580, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6580, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 50, iterations: 50, max_updates: 3000, lr: 0., ups: 0.18, time: 04m 40s 915ms, time_since_start: 16m 25s 254ms, eta: 04h 36m 47s 166ms
2020-08-22T05:57:13 | INFO | mmf.train : progress: 100/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6805, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6805, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 100, iterations: 100, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 494ms, time_since_start: 17m 01s 748ms, eta: 35m 20s 892ms
2020-08-22T05:57:49 | INFO | mmf.train : progress: 150/3000, train/total_loss: 0.7029, train/total_loss/avg: 0.6917, train/hateful_memes/cross_entropy: 0.7029, train/hateful_memes/cross_entropy/avg: 0.6917, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 150, iterations: 150, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 172ms, time_since_start: 17m 37s 921ms, eta: 34m 25s 975ms
2020-08-22T05:58:26 | INFO | mmf.train : progress: 200/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6577, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6577, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 200, iterations: 200, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 537ms, time_since_start: 18m 14s 459ms, eta: 34m 10s 220ms

I think these training parameters in the command are important otherwise training gets out of control in terms of memory requirements.

gireek commented 4 years ago

Update: Model is getting trained and it is saving best.ckpt as well.

In your notebook this is the way to load pretrained model:

from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")

Is there a way to load my self-trained ckpt file into model so I can directly use model.classify as you use in the notebook like this:

image_url = "https://i.imgur.com/tEcsk5q.jpg" #@param {type:"string"}
text = "look how many people love you" #@param {type: "string"}
output = model.classify(image_url, text)
apsdehal commented 4 years ago

Here is working copy of your colab: https://colab.research.google.com/drive/19fV7uQnPNfJ3QmjFzjt5XkQmqIEzUBHJ?usp=sharing I also displayed train.log contents to show that it is indeed training.

apsdehal commented 4 years ago

For your custom model, implement the classify method as implemented here: https://github.com/facebookresearch/mmf/blob/master/mmf/models/interfaces/mmbt.py#L60

gireek commented 4 years ago

Here is working copy of your colab: https://colab.research.google.com/drive/19fV7uQnPNfJ3QmjFzjt5XkQmqIEzUBHJ?usp=sharing I also displayed train.log contents to show that it is indeed training.

yes. training.batch_size was required