facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.48k stars 935 forks source link

Error to train the mmf tutorial ConcatBERT model #582

Closed mmiakashs closed 3 years ago

mmiakashs commented 4 years ago

I was trying to train the following tutorial ConcatBERT model: https://mmf.sh/docs/tutorials/concat_bert However, I am getting the following errors. could anyone please let me know am I missing anything?

Traceback (most recent call last):
  File "/home/anaconda3/envs/proj_mm_meme/bin/mmf_run", line 33, in <module>
    sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')())
  File "/home/repo/mmf/mmf_cli/run.py", line 118, in run
    nprocs=config.distributed.world_size,
  File "/home/anaconda3/envs/proj_mm_meme/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/anaconda3/envs/proj_mm_meme/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/anaconda3/envs/proj_mm_meme/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/proj_mm_meme/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/repo/mmf/mmf_cli/run.py", line 66, in distributed_main
    main(configuration, init_distributed=True, predict=predict)
  File "/home/repo/mmf/mmf_cli/run.py", line 56, in main
    trainer.train()
  File "/home/repo/mmf/mmf/trainers/mmf_trainer.py", line 111, in train
    self.training_loop()
  File "/home/repo/mmf/mmf/trainers/core/training_loop.py", line 31, in training_loop
    self.run_training_epoch()
  File "/home/repo/mmf/mmf/trainers/core/training_loop.py", line 89, in run_training_epoch
    report = self.run_training_batch(batch, num_batches_for_this_update)
  File "/home/repo/mmf/mmf/trainers/core/training_loop.py", line 164, in run_training_batch
    self._backward(loss)
  File "/home/repo/mmf/mmf/trainers/core/training_loop.py", line 187, in _backward
    loss.backward()
AttributeError: 'float' object has no attribute 'backward'
apsdehal commented 4 years ago

Thanks for reporting the issue. We are looking into it.

apsdehal commented 4 years ago

hi,

I think this is happening because there is a typo in the tutorial in the experiment config: Instead of trying in model_config, it should be concat_bert.

Can you try with this experiment config:

includes:
  - configs/datasets/hateful_memes/bert.yaml

model_config:
  concat_bert:
    classifier:
      type: mlp
      params:
        num_layers: 2
    losses:
      - type: cross_entropy

scheduler:
  type: warmup_linear
  params:
    num_warmup_steps: 2000
    num_training_steps: ${training.max_updates}

optimizer:
  type: adam_w
  params:
    lr: 5e-5
    eps: 1e-8

evaluation:
  metrics:
    - accuracy
    - binary_f1
    - roc_auc

training:
  batch_size: 64
  lr_scheduler: true
  max_updates: 22000
  early_stop:
    criteria: hateful_memes/roc_auc
    minimize: false

We will open up a PR to fix this in docs.

mmiakashs commented 3 years ago

@apsdehal Thanks it is working perfectly now. Sorry, I should have to notice that.