Closed yc1999 closed 2 years ago
我记得是卡比较空的时候,写了多卡的代码,是DP的,所以需要padding_to_max,因为后面实验多,卡没有那么多,就没改成DDP,主要是梯度累积
这个结果有问题,建议自己检查下,注意看这个“Note: The thunlp has updated the repo HMEAE recently, which causing the mismatch of data. Make sure you use the earlier version for ED task.”
在事件检测上提升不大,用了HMEAE老版本数据切分后,直接bio标注就能到81以上,甚至82
@huanghuidmml 这个,我自己是没有跑出来过这么高的结果的
Hi~
- 想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:
Training: 0%| | 0/1159 [00:22<?, ?it/s] Traceback (most recent call last): File "run_trigger_extraction.py", line 405, in <module> main() File "run_trigger_extraction.py", line 378, in main train(args, model, processor) File "run_trigger_extraction.py", line 243, in train pred_sub_heads, pred_sub_tails = model(data, add_label_info=add_label_info) File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yc21/project/LEAR/models/model_event.py", line 635, in forward fused_results = self.label_fusing_layer( File "/home/yc21/software/anaconda3/envs/lear/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/yc21/project/LEAR/utils/model_utils.py", line 320, in forward return self.get_fused_feature_with_attn(token_embs, label_embs, input_mask, label_input_mask, return_scores=return_scores) File "/home/yc21/project/LEAR/utils/model_utils.py", line 504, in get_fused_feature_with_attn scores = torch.matmul(token_feature_fc, label_feature_t).view( RuntimeError: shape '[4, 48, 33, -1]' is invalid for input of size 160512
- 我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?
谢谢大家~
第二个问题的原因是我的--task_layer_lr
参数设成了2e-4
,应该是20
才对,原因如下:
https://github.com/Akeepers/LEAR/blob/8ae3ed0ae6fa69a85872395d6cbbbf40a55f1d27/run_trigger_extraction.py#L160-L173
@yc1999 请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错
@yc1999 请问一下,这个ner任务应该怎么配置可以run起来,看得有点懵这么多传参,跑了就报错
请问又跑起来么?我也是不懂这个传参,,
Hi~
想问你在多卡条件下跑过事件检测的代码嘛?我用4张卡跑,报了下面的错误:
我用一张卡跑,设置batchsize=2, gradient_accumulation_step=16, 能够跑通,但是得到的结果是train上的loss收敛到了6,dev的f1一直是0。我不知道这只是我的个人问题,还是有其他人也存在这个问题。有童鞋跑出相应的结果了么?
谢谢大家~