Hi, there 👋. Thanks for your stay in this repo. This project aims at building a universal toolkit for extracting events automatically from documents 📄 (long texts).
The details can be found in our paper: Tong Zhu, Xiaoye Qu, Wenliang Chen, Zhefeng Wang, Baoxing Huai, Nicholas Yuan, Min Zhang. Efficient Document-level Event Extraction via Pseudo-Trigger-aware Pruned Complete Graph. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence Main Track (IJCAI'22). Pages 4552-4558.
🔥 We have an online demo [here] (available in 9:00-17:00 UTC+8).
Currently, this repo contains PTPCG
, Doc2EDAG
and GIT
models, and these models are all designed for document-level event extraction without triggers.
Here are some basic descriptions to help you understand the characteristics of each model:
Make sure you have the following dependencies installed.
gpu-watchmen
# don't forget to install the dee package
$ git clone https://github.com/Spico197/DocEE.git
$ pip install -e .
# or install directly from git
$ pip install git+https://github.com/Spico197/DocEE.git
# ChFinAnn
## You can download Data.zip from the original repo: https://github.com/dolphin-zs/Doc2EDAG
$ unzip Data.zip
$ cd Data
# generate data with doc type (o2o, o2m, m2m) for better evaluation
$ python stat.py
# DuEE-fin
## If you want to win the test, you should check the codes and make further modifications,
## since each role may refer to multiple entities in DuEE-fin.
## Our PTPCG can help with this situation, all you need is to check the data preprocessing
## and check `predict_span_role()` method in `event_table.py`.
## We **do not** perform such magic tricks in the paper to make fair comparisons with Doc2EDAG and GIT.
$ # downloading datasets from https://aistudio.baidu.com/aistudio/competition/detail/65
$ cd Data/DuEEData # paste train.json and dev.json into Data/DuEEData folder and run:
$ python build_data.py
Doc2EDAG and GIT are already integrated in this repo, and more models are planned to be added.
If you want to reproduce the PTPCG results, or run other trials, please follow the instructions below.
Before running any bash script, please ensure bert_model
has been correctly set.
Tip: At least 4 * NVIDIA V100 GPU (at least 16GB) cards are required to run Doc2EDAG models.
# run on ChFinAnn dataset
$ nohup bash scripts/run_doc2edag.sh 1>Logs/Doc2EDAG_reproduction.log 2>&1 &
$ tail -f Logs/Doc2EDAG_reproduction.log
# run on DuEE-fin dataset without trigger
$ nohup bash scripts/run_doc2edag_dueefin.sh.sh 1>Logs/Doc2EDAG_DuEE_fin.log 2>&1 &
$ tail -f Logs/Doc2EDAG_DuEE_fin.log
# run on DuEE-fin dataset with trigger
$ nohup bash scripts/run_doc2edag_dueefin_withtgg.sh 1>Logs/Doc2EDAG_DuEE_fin_with_trigger.log 2>&1 &
$ tail -f Logs/Doc2EDAG_DuEE_fin_with_trigger.log
Tip: At least 4 * NVIDIA V100 GPU (32GB) cards are required to run GIT models.
# run on ChFinAnn dataset
$ nohup bash scripts/run_git.sh 1>Logs/GIT_reproduction.log 2>&1 &
$ tail -f Logs/GIT_reproduction.log
# run on DuEE-fin dataset without trigger
$ nohup bash scripts/run_git_dueefin.sh 1>Logs/GIT_DuEE_fin.log 2>&1 &
$ tail -f Logs/GIT_DuEE_fin.log
# run on DuEE-fin dataset with trigger
$ nohup bash scripts/run_git_dueefin_withtgg.sh 1>Logs/GIT_DuEE_fin_with_trigger.log 2>&1 &
$ tail -f Logs/GIT_DuEE_fin_with_trigger.log
Tip: At least 1 * 1080Ti (at least 9GB) card is required to run PTPCG.
Default: |R| = 1, which means only the first (pseudo) trigger is selected.
# run on ChFinAnn dataset (to reproduce |R|=1 results in Table 1 of the PTPCG paper)
$ nohup bash scripts/run_ptpcg.sh 1>Logs/PTPCG_R1_reproduction.log 2>&1 &
$ tail -f Logs/PTPCG_R1_reproduction.log
# run on DuEE-fin dataset without annotated trigger (to reproduce |R|=1, Tgg=× results in Table 3 of the PTPCG paper)
$ nohup bash scripts/run_ptpcg_dueefin.sh 1>Logs/PTPCG_P1-DuEE_fin.log 2>&1 &
$ tail -f Logs/PTPCG_P1-DuEE_fin.log
# run on DuEE-fin dataset with annotated trigger and without pseudo trigger (to reproduce |R|=0, Tgg=√ results in Table 3 of the PTPCG paper)
$ nohup bash scripts/run_ptpcg_dueefin_withtgg.sh 1>Logs/PTPCG_T1-DuEE_fin.log 2>&1 &
$ tail -f Logs/PTPCG_T1-DuEE_fin.log
# run on DuEE-fin dataset with annotated trigger and one pseudo trigger (to reproduce |R|=1, Tgg=√ results in Table 3 of the PTPCG paper)
$ nohup bash scripts/run_ptpcg_dueefin_withtgg_withptgg.sh 1>Logs/PTPCG_P1T1-DuEE_fin.log 2>&1 &
$ tail -f Logs/PTPCG_P1T1-DuEE_fin.log
#PseudoTgg | Setting | Log | Task Dump |
---|---|---|---|
1 | 189Cloud | 189Cloud | 189Cloud |
Explainations on PTPCG hyperparameters in the executable script:
# whether to use max clique decoding strategy, brute-force if set to False
max_clique_decode = True
# number of triggers when training, to make all arguments as pseudo triggers, set to higher numbers like `10`
num_triggers = 1
# number of triggers when evaluating, set to `-1` to make all arguments as pseudo triggers
eval_num_triggers = 1
# put additional pseudo triggers into the graph, make full use of the pseudo triggers
with_left_trigger = True
# make the trigger graph to be directed
directed_trigger_graph = True
# run mode is used in `dee/tasks/dee_task.py/DEETaskSetting`
run_mode = 'full'
# at least one combination (see paper for more information)
at_least_one_comb = True
# whether to include regex matched entities
include_complementary_ents = True
# event schemas, check `dee/event_types` for all support schemas
event_type_template = 'zheng2019_trigger_graph'
Please check Data/trigger.py
for more details.
In general, you should first convert your data into acceptable format (like typed_train.json
after building ChFinAnn).
Then, you can run the command below to generate event schemas with pseudo triggers and importance scores:
$ cd Data
$ python trigger.py <max number of pseudo triggers>
dee
has evoluted to a toolkit package, make sure to install the package first: pip install -e .
typed_(train|dev|test).json
files first via cd Data && python stat.py
after Data.zip
file unzipped into the Data
folder.--parallel_decorate
flag after python run_dee_task.py
.tzhu
are added by Tong Zhu to help understanding the codes, not written in the original Doc2EDAG repo.dueefin_post_process.py
to make further post process to meet the format requirments.teacher_prob
doing ?
gold_span
. If teacher_prob == 0.7
, then there is 70% probability to use gold_span
during training. teacher_prob
will decrease during training.Greedy
method.dee/tasks/dee_task.py/DEETask.predict_one()
(Convenient online serving interface).inference.py
. Change settings in line 8,9,12 and run CUDA_VISIBLE_DEVICES="<cuda device, could be empty to use cpu>" python inference.py
to quickly start.dee_task.predict_one
API in inference.py
is ONLY used for simple inference prediction. Due to some unconsistencies in processing the raw data (mostly the sentence segmentation behaviour), this is not suitable to reproduce experimental results on ChFinAnn. If you want to reproduce results as reported in our paper, please follow the instructions here.o2o
, o2m
and m2m
?
one-type one-instance per doc
, one-type with multiple instances per doc
and multiple types per doc
.Exps/<task_name>/Output/dee_eval.(dev|test).(pred|gold)_span.<model_name>.<epoch>.json
, what are those mean?
Evluation
section of documents, or refer to #7.This work has been accepted to IJCAI'22, please cite the paper if you use PTPCG or this repository in your research. Thank you very much 😉
@inproceedings{ijcai2022p632,
title = {Efficient Document-level Event Extraction via Pseudo-Trigger-aware Pruned Complete Graph},
author = {Zhu, Tong and Qu, Xiaoye and Chen, Wenliang and Wang, Zhefeng and Huai, Baoxing and Yuan, Nicholas and Zhang, Min},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {Lud De Raedt},
pages = {4552--4558},
year = {2022},
month = {7},
note = {Main Track},
doi = {10.24963/ijcai.2022/632},
url = {https://doi.org/10.24963/ijcai.2022/632},
}
MIT Licence
doc_type
bug (#60)List[str]
in dee_task.predict_one
in case of any misunderstanding. Change behaviour of event_role_embed
in dee/models/deppn.py/SetPred4DEE.forward()
DEPPNModel
(beta), change luge_*
templates into dueefin_*
, add OtherType
as default common_fields
in dueefin_(w|wo)_tgg
templates, add isort
tool to help formattingLSTMMTL2EDAGModel
, EventTableForIndependentTypeCombination
, DEEMultiStepTriggeringFeatureConverter
and DEEMultiStepTriggeringFeature
which are redundant. Update test cases via zheng2019_trigger_graph
schema. Codes are formatted by black
.This repo is still under development. If you find any bugs, don't hesitate to drop us an issue.
Thanks~