Manually annotated corpus for DFI extraction

DFI (Drug-Food interaction) corpus is the largest manually annotated corpus consisted of 2271 abstracts of biomedical articles published by PubMed for developing an NLP model extracting DFI. 1. We introduced our manually annotated corups for extracting DFI information from abstracts of biomedical articles and suggested ‘DFI key-sentence’ as a target entity for DFI extraction. To best our knowledge, our dataset for DFI extraction is the first manually annotated dataset for extracting DFI from biomedical articles and the largest and the most comprehensive dataset for extracting drug interaction, including DDI.

Distribution of evidence-level and named entities of the `DFI corpus`

Table 1. Distribution of the annotated evidence-levels in the `DFI corpus`	Evidence-level	Training	Development
'clinical trial'	116 (7.30)	33 (7.24)	16 (7.08)
'observational study'	78 (4.91)	23 (5.04)	11 (4.87)
'case report'	30 (1.89)	9 (2.97)	4 (1.77)
'in-vivo study'	547 (34.42)	157 (34.42)	78 (34.51)
'in-vitro study'	477 (30.02)	137 (30.04)	68 (30.09)
'bioanalysis'	91 (5.73)	26 (5.70)	13 (5.75)
'others'	250 (15.73)	71 (15.57)	36 (15.93)
total	384965 (100.0)	112485 (100.0)	54921 (100.0)

Table 2. Distribution of the annotated entity types in the `DFI corpus`	Entity type	Training	Development
'drug'	5632 (1.46)	1669 (1.48)	787 (1.43)
'food'	9384 (2.44)	2621 (2.33)	1348 (2.45)
'food component'	902 (0.23)	377 (0.34)	63 (0.11)
'ambiguous'	452 (0.12)	118 (0.10)	153 (0.28)
'well known target'	6065 (1.58)	1723 (1.53)	679 (1.24)
'drug metabolizer'	697 (0.18)	176 (0.16)	125 (0.23)
'drug transporter'	288 (0.07)	113 (0.10)	14 (0.03)
total	384965 (100.0)	112485 (100.0)	54921 (100.0)

Performance of BERT models trained on `DFI corpus`

Table 3. Performance score (F1) of BERT models for DFI extraction tasks	Classification tasks	Base-BERT	BioBERT	PubMedBERT
Key-sentence classification	82.0	82.6	85.1	81.4
Evidence level annotation
weighted F1 score	70.6	72.8	70.4	67.3
macro F1 score	61.9	65.6	63.1	53.6
Named entity recognition
weighted F1 score	80.0	83.1	83.8	79.3
macro F1 score	83.1	85.2	86.1	82.3

ccadd-snu / corpus-for-DFI-extraction

readme

Manually annotated corpus for DFI extraction

Distribution of evidence-level and named entities of the `DFI corpus`

Performance of BERT models trained on `DFI corpus`

ccadd-snu / corpus-for-DFI-extraction

readme

Manually annotated corpus for DFI extraction

Distribution of evidence-level and named entities of the DFI corpus

Performance of BERT models trained on DFI corpus

Distribution of evidence-level and named entities of the `DFI corpus`

Performance of BERT models trained on `DFI corpus`