Encounter RuntimeError while running with Apex

subercui commented 4 years ago

Running apex with allennlp train configs/contrastive.jsonnet -s tmp --include-package t2t -o "{"trainer": {"opt_level": 'O1'}}" returns exceptions as following:

Traceback (most recent call last):
  File "/h/haotian/.conda/envs/t2tCLR/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/__main__.py", line 18, in run
    main(prog="allennlp")
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/__init__.py", line 93, in main
    args.func(args)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 143, in train_model_from_args
    dry_run=args.dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 202, in train_model_from_file
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 265, in train_model
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 462, in _train_worker
    metrics = train_loop.run()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 521, in run
    return self.trainer.train()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 687, in train
    train_metrics = self._train_epoch(epoch)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 465, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 380, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch/ssd001/home/haotian/Code/t2t/t2t/models/contrastive_text_encoder.py", line 122, in forward
    contrastive_loss = self._loss(embeddings, labels)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/base_metric_loss_function.py", line 53, in forward
    loss = self.compute_loss(embeddings, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 40, in compute_loss
    return self.loss_method(mat, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 59, in pair_based_loss
    return self._compute_loss(pos_pair, neg_pair, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/ntxent_loss.py", line 20, in _compute_loss
    max_val = torch.max(pos_pairs, torch.max(neg_pairs, dim=1, keepdim=True)[0])
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'other' in call to _th_max
  0%|                                                                                                   | 0/1 [00:01<?, ?it/s]

JohnGiorgi commented 4 years ago

Thanks @subercui, this error arises because the PyTorch Metric Learning library. I opened an issue on Apex here but no response :( maybe you can open an issue on PyTorch Metric Learning?

subercui commented 4 years ago

Thanks! I'll have a look

JohnGiorgi commented 4 years ago

I found a manual solution that works. Install PyTorch Metric Learning from source and change:

torch.max(neg_pairs, dim=1, keepdim=True)[0])

to

torch.max(neg_pairs, dim=1, keepdim=True)[0].half())

in NTXentLoss. Still, I think it makes sense to raise this issue on the PyTorch Metric Learning github.

KevinMusgrave commented 4 years ago

I think this happens because I create infinity values using python's float('inf'). I could have an optional half_precision flag for all loss functions, and if it's True, then cast all numbers made with float() to pytorch's half()

JohnGiorgi commented 4 years ago

Ah, I think you are right. There's a discussion on this HF Transformers PR where they end up writing an assert for a similar scenario:

masked_bias = self.masked_bias.to(w.dtype)
assert masked_bias.item() != -float("inf"), "Make sure `self.masked_bias` is not `-inf` in fp16 mode"
w = torch.where(mask, w, masked_bias)

What about replacing float('inf') with a very large value instead (see here)? That way, amp can handle it automatically and there's no need for the user to specify half_precision (update: upon closer inspection of that issue, I am not sure if this will actually work).

KevinMusgrave commented 4 years ago

At least for NTXentLoss, setting it to a large negative value (instead of float('-inf')) would be fine, because the purpose is to make particular entries 0 when passed to torch.exp. I'll have to check if it makes sense for the other places where I use float

JohnGiorgi commented 4 years ago

Awesome, thanks for weighing in!

KevinMusgrave commented 4 years ago

v0.9.90.dev0 supports half precision

pip install pytorch-metric-learning==0.9.90.dev0

JohnGiorgi commented 4 years ago

@KevinMusgrave Awesome! Thanks a lot.

JohnGiorgi / DeCLUTR

Encounter RuntimeError while running with Apex #60