Question about preprocessing of GENIA dataset

dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.

MIT License

569 stars 120 forks source link

Question about preprocessing of GENIA dataset #119

Closed serenalotreck closed 1 year ago

serenalotreck commented 1 year ago

Apologies if this is already described somewhere and I missed it.

The original GENIA dataset described in the Kim et al. 2003 paper contained 47 entity types. The processed version used with DyGIE++ has 5. Was that a result of pre-processing, and if so, why were the other types excluded?

Thanks!

serenalotreck commented 1 year ago

Aha, found the explanation in the Labeling Gaps Between Words paper: "We made the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other mention types, resulting in 5 mention types".

dwadden commented 1 year ago

Glad you got it figured out!