Akeepers / LEAR

The implementation our EMNLP 2021 paper "Enhanced Language Representation with Label Knowledge for Span Extraction".
113 stars 13 forks source link

事件检测任务的复现 #5

Closed yc1999 closed 2 years ago

yc1999 commented 2 years ago

Hi~

  1. 想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:

    Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
    Traceback (most recent call last):
    File "run_trigger_extraction.py", line 405, in <module>
    main()
    File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
    File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
    RuntimeError: Caught RuntimeError in replica 0 on device 0.
    Original Traceback (most recent call last):
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
    File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
    File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
    File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
    RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512
  2. 我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?

谢谢大家~

Akeepers commented 2 years ago
  1. 我记得是卡比较空的时候,写了多卡的代码,是DP的,所以需要padding_to_max,因为后面实验多,卡没有那么多,就没改成DDP,主要是梯度累积

  2. 这个结果有问题,建议自己检查下,注意看这个“Note: The thunlp has updated the repo HMEAE recently, which causing the mismatch of data. Make sure you use the earlier version for ED task.”

HuiResearch commented 2 years ago

在事件检测上提升不大,用了HMEAE老版本数据切分后,直接bio标注就能到81以上,甚至82

Akeepers commented 2 years ago

@huanghuidmml 这个,我自己是没有跑出来过这么高的结果的

yc1999 commented 2 years ago

Hi~

  1. 想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:
Training:   0%|                                                                                                                                                               | 0/1159 [00:22<?, ?it/s]
Traceback (most recent call last):
  File "run_trigger_extraction.py", line 405, in <module>
    main()
  File "run_trigger_extraction.py", line 378, in main
    train(args, model, processor)
  File "run_trigger_extraction.py", line 243, in train
    pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward
    fused_results = self.label_fusing_layer(
  File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward
    return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores)
  File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn
    scores = torch.matmul(token_feature_fc, label_feature_t).view(
RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512
  1. 我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?

谢谢大家~

第二个问题的原因是我的--task_layer_lr 参数设成了2e-4,应该是20才对,原因如下: https://github.com/Akeepers/LEAR/blob/8ae3ed0ae6fa69a85872395d6cbbbf40a55f1d27/run_trigger_extraction.py#L160-L173

Senwang98 commented 2 years ago

@yc1999 请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错

MoDawn commented 1 year ago

@yc1999 请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错

请问又跑起来么?我也是不懂这个传参,,