Open PureNatural opened 5 months ago
这个openperf项目的数据集是需要自己获取吗?
还是说我们使用任何一个开源数据集当我们实验数据集都可以?
还是说我们使用任何一个开源数据集当我们实验数据集都可以?
数据集需要自己来构建,需要使用开源生态场景下的文本数据,举个例子,可以获取开源仓库下的readme文档,获取文档中的实体。关于每个仓库的描述内容可以通过GitHub官方提供的Rest API获取,GitHub行为日志数据可以通过https://www.gharchive.org/ 获取,包含了所有issue PR commit相关的评论文本。
最后数据集效果类似下图:
实体类型可以自己来定义,总而言之,只要是在开源社区场景下的命名实体任务即可。
可以参考一下这个论文: https://aclanthology.org/2020.acl-main.443/
Description
This task aims to construct an open-source community named entity recognition (NER) dataset and implement corresponding methods. By collecting and annotating textual data from the open-source community, especially content containing named entities, we will create a dataset for training and evaluating NER models. Additionally, you will explore and implement various NER methods, including rule-based, statistical, or deep learning approaches, to enhance the performance and applicability of the models.
The relevant code and dataset for this task need to be provided in the repository.