Coopercoppers / PFN

EMNLP 2021 - A Partition Filter Network for Joint Entity and Relation Extraction
MIT License
171 stars 20 forks source link

RuntimeError: CUDA error: device-side assert triggered #28

Open wxqyyds opened 5 months ago

wxqyyds commented 5 months ago

微信截图_20240527175817 微信截图_20240527175848 你好,我最近想用这个方法去结合半监督学习来实现实体关系抽取,我将数据集拆分成有标签和无标签的,但是在训练时,一直会报一些错误。比如:RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.这是我仅运行有标签训练部分 代码时发生的,运行整个框架时也会发生。但是我有标签训练部分代码没修改几乎是源码,为什么会发生这种问题?下面是部分代码: if args.do_train: logger.info("------Training------") if args.embed_mode == "albert": input_size = 4096 else: input_size = 768

    model = PFN(args, input_size, ner2idx, rel2idx)
    model.to(device)

    optimizer = optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

    if args.eval_metric == "micro":
        metric = micro(rel2idx, ner2idx)
    else:
        metric = macro(rel2idx, ner2idx)

    BCEloss = loss()
    best_result = 0
    triple_best = None
    entity_best = None

    for epoch in range(args.epoch):
        steps, train_loss, loss_unlabeled, loss_labeled = 0, 0, 0, 0
        file_num = 1
        model.train()        
        for labeled_data in tqdm(labeled_batch):
            steps += 1
            optimizer.zero_grad()

            # 有标签数据
            text = labeled_data[0]
            ner_label = labeled_data[1].to(device)
            re_label = labeled_data[2].to(device)
            mask = labeled_data[-1].to(device)

            ner_pred, re_pred = model(text, mask)
            labeled_loss = BCEloss(ner_pred, ner_label, re_pred, re_label)
            labeled_loss.backward()
            train_loss += labeled_loss.item()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=args.clip)
            optimizer.step()

            if steps % args.steps == 0:
                logger.info("Epoch: {}, step: {} / {}, train_loss = {:.4f}".format
                            (epoch, steps, len(labeled_batch), (train_loss) / steps))

        logger.info("------ Training Set Results ------")
        logger.info("loss : {:.4f}".format((train_loss) / steps))