RowitZou / topic-dialog-summ

AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.
MIT License
77 stars 9 forks source link

关于数据集格式问题 #12

Closed windhxs closed 2 years ago

windhxs commented 2 years ago

你好!您 readme文档中说 json文件的形式是这样的:

{"session": [
    // Utterance
    {
     // Chinese characters
     "content": ["请", "问", "有", "什", "么", "可", "以", "帮", "您"],
     // Chinese Words
     "word": ["请问", "有", "什么", "可以", "帮", "您"],
     // Role info (Agent)
     "type": "客服"
    },

    {"content": ["我", "想", "退", "货"],
     "word": ["我", "想", "退货"],
     // Role info (Customer)
     "type": "客户"}, 

    ...
 ],
 "summary": ["客", "户", "来", "电", "要", "求", "退", "货", "。", ...]
}

但是无论是百度网盘,还是谷歌云上解压后的json文件是这样的格式

{"session": [{"content": ["17363", "17794"], "word": ["7", "47"], "type": "客户"}, ...}

貌似已经被tokenize过了,问题一,是这样的吗?

问题二,如果已经被tokenize过,那么第二步运行preprocess.py文件,会再次tokenize,这样是否会有问题呢?

RowitZou commented 2 years ago

因为涉及到用户的隐私数据,所以数据集是经过ID化的。这里的每一个id都是代表一个汉字或词语,并不是代表tokenize后的结果。您可以将自己的数据按照我们数据集的格式进行处理,便可以正常运行。对此带来的不便我们深感抱歉。

windhxs commented 2 years ago

了解,感谢回复