Closed gireek closed 4 years ago
Follow the prerequisites here : https://github.com/facebookresearch/mmf/tree/master/projects/hateful_memes#prerequisites
I tried the following from https://github.com/apsdehal/hm_example_mmf:
MMF_USER_DIR="." mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0
also:
mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0 run_type=train_val
because your docs here: https://mmf.sh/docs/challenges/hateful_memes_challenge
listed run_type=train_val option for training but training does Not happen.
The logs in the save folder end up like this :
2020-08-20T17:24:02 | INFO | mmf.train : Total Parameters: 59437822. Trained Parameters: 59437822 2020-08-20T17:24:02 | INFO | mmf.train : Starting training... 2020-08-20T17:24:02 | INFO | mmf.train : Loading fasttext model now from /root/.cache/torch/mmf/wiki.en.bin 2020-08-20T17:28:04 | INFO | mmf.train : Finished loading fasttext model
and the models folder in the save folder is always empty.
Can you show the configs/experiments/defaults.yaml
that you are using?
This is what comes on the terminal:
2020-08-20 19:18:34.411776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Namespace(config_override=None, local_rank=None, opts=['config=configs/experiments/defaults.yaml', 'model=concat_vl', 'dataset=hateful_memes', 'training.num_workers=0', 'run_type=train_val'])
/usr/local/lib/python3.6/dist-packages/mmf/utils/configuration.py:528: UserWarning: Device specified is 'cuda' but cuda is not present. Switching to CPU version.
+ "Switching to CPU version."
Importing user_dir from /content/hm_example_mmf
Overriding option config to configs/experiments/defaults.yaml
Overriding option model to concat_vl
Overriding option datasets to hateful_memes
Overriding option training.num_workers to 0
Overriding option run_type to train_val
Using seed 36408782
Logging to: ./save/logs/train_2020-08-20T19:18:36.log
Downloading extras.tar.gz: 100% 211k/211k [00:00<00:00, 279kB/s]
169876453it [05:55, 478358.09it/s]
Downloading: "https://download.pytorch.org/models/resnet152-b121ed2d.pth" to /root/.cache/torch/checkpoints/resnet152-b121ed2d.pth
100% 230M/230M [00:03<00:00, 60.8MB/s]
tcmalloc: large alloc 5423251456 bytes == 0x47eb8000 @ 0x7f587bdd2887 0x7f57db499593 0x7f57db490897 0x7f57db4913b1 0x7f57db44e883 0x7f57db471b00 0x566ddc 0x50a783 0x50c1f4 0x507f24 0x509202 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x509015 0x594b01 0x54ac61 0x59fe1e 0x50d596 0x507f24 0x509202 0x594b01
tcmalloc: large alloc 3023249408 bytes == 0x18babc000 @ 0x7f587bdd2887 0x7f57db499593 0x7f57db4908db 0x7f57db4913b1 0x7f57db44e883 0x7f57db471b00 0x566ddc 0x50a783 0x50c1f4 0x507f24 0x509202 0x594b01 0x54a17f 0x5517c1 0x5a9eec 0x50a783 0x50cfd6 0x509918 0x50a64d 0x50c1f4 0x509918 0x50a64d 0x50c1f4 0x509015 0x594b01 0x54ac61 0x59fe1e 0x50d596 0x507f24 0x509202 0x594b01
^C
Suprisingly the '^C' comes automatically in the last line. I amdoing this on Colab using GPU. The other issue is even on using GPU it gives this:
/usr/local/lib/python3.6/dist-packages/mmf/utils/configuration.py:528: UserWarning: Device specified is 'cuda' but cuda is not present. Switching to CPU version.
Hi @vedanuj I have followed every step here : https://github.com/gireek/facebook_mmf/blob/master/mmf.ipynb on both Colab GPU and TPU but no model gets trained. Please check
Try adding this where you are installing the pip package:
!pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Where exactly are you running this? Please provide us output of python -m torch.utils.collect_env
@apsdehal this is on Colab GPU
here is the output :
Collecting environment information...
PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect
Did you change your runtime type to GPU?
I have tested it after changing runtime to GPU and making sure torch and torchvision versions are installed as @vedanuj suggested and then restarting the runtime. Also, it seems like with FastText it will likely run out of memory. You can try our own colab for this at https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb
Did you change your runtime type to GPU?
Yes I did that.
I have tested it after changing runtime to GPU and making sure torch and torchvision versions are installed as @vedanuj suggested and then restarting the runtime. Also, it seems like with FastText it will likely run out of memory. You can try our own colab for this at https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb
When you tried did it work for you? I think it might be true that Fasttext is too big for Colab RAM specs. Is there a way to load less a lighter version of Fasttext maybe? Something like smaller versions of Glove files? Apologies if this is not possible as I am new to NLP.
No, it is working fine. It throws tracemalloc error but works fine afterwards. See https://colab.research.google.com/drive/1s1Yfc0DmU5cuMVaDUokbAlcRTT0Te9fs?usp=sharing.
@apsdehal it stops after the tracemalloc error for me in Colab on your shared notebook, I have kept Runtime as GPU. In your shared notebook also there are no training logs on the terminal. Do they go into the logs file? And secondly in your case too ./save/models in empty. Where does the model get stored then?
Are you sure it stopped? It just doesn’t log anything. You can check logs in save/train.log file. You won’t see my local drive or if models are there. Your environment is brand new. It did save the models for me.
Yes @apsdehal it stops and actually the Colab you shared has a ls ./save/models
so if anything was saved it should have been listed right. I tried once again with your Colab and saved it here: https://github.com/gireek/facebook_mmf/blob/master/Copy_of_Copy_of_mmf.ipynb
The last line shows that ./save/models is empty. The jupyter notebook was run as it is on Colab.
After it stops i checked the ./save/train.log and these are the last 3 lines:
2020-08-22T04:56:36 | INFO | mmf.train : Total Parameters: 59437822. Trained Parameters: 59437822
2020-08-22T04:56:36 | INFO | mmf.train : Starting training...
2020-08-22T04:56:36 | INFO | mmf.train : Loading fasttext model now from /root/.cache/torch/mmf/wiki.en.bin
Kindly check this notebook: https://github.com/gireek/facebook_mmf/blob/master/Copy_of_Copy_of_mmf.ipynb
I didn’t run the last cell. I only checked the execution. I can share the full execution later with you.
Update: running longer it runs out of CUDA memory. Btw, same thing is also available in tutorial created by us: https://colab.research.google.com/github/facebookresearch/mmf/blob/notebooks/notebooks/mmf_hm_example.ipynb. Why don't you try that one?
Doing this now as given in the notebook mentioned above by you:
!MMF_USER_DIR="." mmf_run config="configs/experiments/defaults.yaml" model=concat_vl dataset=hateful_memes training.num_workers=0 training.log_interval=50 training.max_updates=3000 training.batch_size=16 training.evaluation_interval=500
the logs are now:
2020-08-22T05:56:37 | INFO | mmf.train : progress: 50/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6580, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6580, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 50, iterations: 50, max_updates: 3000, lr: 0., ups: 0.18, time: 04m 40s 915ms, time_since_start: 16m 25s 254ms, eta: 04h 36m 47s 166ms
2020-08-22T05:57:13 | INFO | mmf.train : progress: 100/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6805, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6805, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 100, iterations: 100, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 494ms, time_since_start: 17m 01s 748ms, eta: 35m 20s 892ms
2020-08-22T05:57:49 | INFO | mmf.train : progress: 150/3000, train/total_loss: 0.7029, train/total_loss/avg: 0.6917, train/hateful_memes/cross_entropy: 0.7029, train/hateful_memes/cross_entropy/avg: 0.6917, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 150, iterations: 150, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 172ms, time_since_start: 17m 37s 921ms, eta: 34m 25s 975ms
2020-08-22T05:58:26 | INFO | mmf.train : progress: 200/3000, train/total_loss: 0.6580, train/total_loss/avg: 0.6577, train/hateful_memes/cross_entropy: 0.6580, train/hateful_memes/cross_entropy/avg: 0.6577, max mem: 3657.0, experiment: run, epoch: 1, num_updates: 200, iterations: 200, max_updates: 3000, lr: 0., ups: 1.39, time: 36s 537ms, time_since_start: 18m 14s 459ms, eta: 34m 10s 220ms
I think these training parameters in the command are important otherwise training gets out of control in terms of memory requirements.
Update: Model is getting trained and it is saving best.ckpt as well.
In your notebook this is the way to load pretrained model:
from mmf.models.mmbt import MMBT
model = MMBT.from_pretrained("mmbt.hateful_memes.images")
Is there a way to load my self-trained ckpt file into model so I can directly use model.classify as you use in the notebook like this:
image_url = "https://i.imgur.com/tEcsk5q.jpg" #@param {type:"string"}
text = "look how many people love you" #@param {type: "string"}
output = model.classify(image_url, text)
Here is working copy of your colab: https://colab.research.google.com/drive/19fV7uQnPNfJ3QmjFzjt5XkQmqIEzUBHJ?usp=sharing I also displayed train.log contents to show that it is indeed training.
For your custom model, implement the classify method as implemented here: https://github.com/facebookresearch/mmf/blob/master/mmf/models/interfaces/mmbt.py#L60
Here is working copy of your colab: https://colab.research.google.com/drive/19fV7uQnPNfJ3QmjFzjt5XkQmqIEzUBHJ?usp=sharing I also displayed train.log contents to show that it is indeed training.
yes. training.batch_size was required
Problem edited below