X-lab2017 / open-perf

Benchmark suit for large scale socio-technical datasets in open collaboration
MIT License
11 stars 18 forks source link

[OSS101] Task 3: Construction of Dataset and Method Implementation for Named Entity Recognition in the Open Source Community #59

Open PureNatural opened 5 months ago

PureNatural commented 5 months ago

Description

This task aims to construct an open-source community named entity recognition (NER) dataset and implement corresponding methods. By collecting and annotating textual data from the open-source community, especially content containing named entities, we will create a dataset for training and evaluating NER models. Additionally, you will explore and implement various NER methods, including rule-based, statistical, or deep learning approaches, to enhance the performance and applicability of the models.

The relevant code and dataset for this task need to be provided in the repository.

YeexiaoZheng commented 4 months ago

这个openperf项目的数据集是需要自己获取吗?

YeexiaoZheng commented 4 months ago

还是说我们使用任何一个开源数据集当我们实验数据集都可以?

PureNatural commented 4 months ago

还是说我们使用任何一个开源数据集当我们实验数据集都可以?

数据集需要自己来构建,需要使用开源生态场景下的文本数据,举个例子,可以获取开源仓库下的readme文档,获取文档中的实体。关于每个仓库的描述内容可以通过GitHub官方提供的Rest API获取,GitHub行为日志数据可以通过https://www.gharchive.org/ 获取,包含了所有issue PR commit相关的评论文本。

最后数据集效果类似下图:

image

实体类型可以自己来定义,总而言之,只要是在开源社区场景下的命名实体任务即可。

PureNatural commented 4 months ago

可以参考一下这个论文: https://aclanthology.org/2020.acl-main.443/