aistairc / DeepEventMine

DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts
Apache License 2.0
95 stars 20 forks source link

1. DeepEventMine

A deep leanring model to predict named entities, triggers, and nested events from biomedical texts.

DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts, Bioinformatics, 2020.

1.1. Features

1.2. Tasks

  1. cg: Cancer Genetics (CG), 2013
  2. ge11: GENIA Event Extraction (GENIA), 2011
  3. ge13: GENIA Event Extraction (GENIA), 2013
  4. id: Infectious Diseases (ID), 2011
  5. epi: Epigenetics and Post-translational Modifications (EPI), 2011
  6. pc: Pathway Curation (PC), 2013
  7. mlee: Multi-Level Event Extraction (MLEE)

1.3. Our trained models and scores

2. Preparation

2.1. Requirements

virtualenv -p python3 pytorch-env
source pytorch-env/bin/activate
export CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/usr/local/cuda pip install torch==1.1.0 torchvision==0.3.0
sh install.sh

2.2. BERT

sh download.sh bert

2.3. DeepEventMine

sh download.sh deepeventmine [task]

2.4 Brat

sh download.sh brat

3. Predict (BioNLP tasks)

3.1. Prepare data

  1. Download corpora
    • To download the original data sets from BioNLP shared tasks.
    • [task] = cg, pc, ge11, etc
sh download.sh bionlp [task]
  1. Preprocess data

    • Tokenize texts and prepare data for prediction
      sh preprocess.sh bionlp
  2. Generate configs

    • If using GPU: [gpu] = 0, otherwise: [gpu] = -1
    • [task] = cg, pc, etc
      sh run.sh config [task] [gpu]

3.2. Predict

  1. For development and test sets (given gold entities)
    • CG task: [task] = cg
    • PC task: [task] = pc
    • Similarly for: ge11, ge13, epi, id, mlee
sh run.sh predict [task] gold dev
sh run.sh predict [task] gold test

3.3. Evaluate

  1. Retrieve the original offsets and create zip format

    sh run.sh offset [task] gold dev
    sh run.sh offset [task] gold test
  2. Submit the zipped file to the shared task evaluation sites:

  1. Evaluate events
sh run.sh eval [task] gold dev sp

4. End-to-end

4.1. Input: a single PMID or PMCID

T24 Organism 1248 1254  bovine
T25 Gene_or_gene_product 1255 1259  u-PA
T55 Positive_regulation 1107 1116   increased
T57 Localization 1170 1179  migration
T58 Negative_regulation 1260 1267   blocked
...

T23 Gene_or_gene_product 1184 1188  u-PA
T56 Positive_regulation 1157 1166   increases
E9  Positive_regulation:T56 Theme:T23

T26 Gene_or_gene_product 1320 1325  c-src
T62 Gene_expression 1326 1336   expression
E10 Gene_expression:T62 Theme:T26

T61 Positive_regulation 1310 1319   increased
E24 Positive_regulation:T61 Theme:E10



## 4.2. Input: a list of PMIDs - Given an arbitrary name for your raw text data, for example "my-pubmed" - Prepare a list of PMID and PMCID in the path ```bash data/my-pubmed/pmid.txt ``` ```bash sh pubmed.sh e2e pmids my-pubmed cg 0 ``` ## 4.3. Input: raw text files - Given an arbitrary name for your raw text data, for example "my-pubmed" - Prepare your raw text files in the path ```bash data/my-pubmed/text/PMID-*.txt data/my-pubmed/text/PMC-*.txt ``` ```bash sh pubmed.sh e2e rawtext my-pubmed cg 0 ``` # 5. Predict for new data (step-by-step) - Input: your own raw text or PubMed ID - Output: predicted entities and events in brat format ## 5.1. Raw text - Given an arbitrary name for your raw text data, for example "my-pubmed" - Prepare your own raw text in the following path ```bash data/my-pubmed/text/PMID-*.txt data/my-pubmed/text/PMC-*.txt ``` ## 5.2. PubMed ID - Or, you can automatically get raw text given PubMed ID or PMC ID ### Get raw text 1. PubMed ID list - In order to get full text given PMC ID, the text should be available in ePub (for our current version). - Prepare your list of PubMed ID and PMC ID in the path ```bash data/my-pubmed/pmid.txt ``` - Get text from the PubMed ID ```bash sh pubmed.sh pmids my-pubmed ``` 2. PubMed ID - You can also get text by directly input a PubMed or PMC ID ```bash sh pubmed.sh pmid 1370299 sh pubmed.sh pmcid PMC4353630 ``` ### Preprocess ```bash sh pubmed.sh preprocess my-pubmed ``` ## 5.3. Predict 1. Generate config - Generate config for prediction - The data name to predict: my-pubmed - The trained model used for predict: cg (or pc, ge11, etc) - If you use gpu [gpu]=0, otherwise [gpu]=-1 ```bash sh pubmed.sh config my-pubmed cg 0 ``` 2. Predict ```bash sh pubmed.sh predict my-pubmed ``` 3. Retrieve the original offsets ```bash sh pubmed.sh offset my-pubmed ``` - Check the output in ```bash experiments/my-pubmed/results/ev-last/my-pubmed-brat ``` # 6. Visualization ## 6.1. Prepare data - Copy the predicted data into the brat folder to visualize - For the raw text prediction: ```bash sh pubmed.sh brat my-pubmed cg ``` - Or for the shared task ```bash sh run.sh brat [task] gold dev sh run.sh brat [task] gold test ``` ## 6.2. Visualize - The data to visualize is located in ```bash brat/brat-v1.3_Crunchy_Frog/data/my-pubmed-brat brat/brat-v1.3_Crunchy_Frog/data/[task]-brat ``` # 7. Acknowledgements This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion PrograM). # 8. Citation ```bash @article{10.1093/bioinformatics/btaa540, author = {Trieu, Hai-Long and Tran, Thy Thy and Duong, Khoa N A and Nguyen, Anh and Miwa, Makoto and Ananiadou, Sophia}, title = "{DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts}", journal = {Bioinformatics}, year = {2020}, month = {06}, issn = {1367-4803}, doi = {10.1093/bioinformatics/btaa540}, url = {https://doi.org/10.1093/bioinformatics/btaa540}, note = {btaa540}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa540/33399046/btaa540.pdf}, } ```