Data format required for training

san7988 commented 3 years ago

Hi,

I'm trying to train an english model using SAMSum dataset. Based on my understanding so far, we'll have to convert the SAMSum data to the format that the code supports, which is:

{"session": [
    // Utterance
    {
     // Chinese characters
     "content": ["请", "问", "有", "什", "么", "可", "以", "帮", "您"],
     // Chinese Words
     "word": ["请问", "有", "什么", "可以", "帮", "您"],
     // Role info (Agent)
     "type": "客服"
    },
    {"content": ["我", "想", "退", "货"],
     "word": ["我", "想", "退货"],
     // Role info (Customer)
     "type": "客户"}, 

    ...
 ],
 "summary": ["客", "户", "来", "电", "要", "求", "退", "货", "。", ...]
}

Can you please help me understand what content and word pertain to? Each record in SAMSum data is in the format:

{
 "id": <unique_id>,
 "summary": "summary text",
 "dialogue": "A: some text\r\nB: some text\r\nA: some other text"
}

Thanks

RowitZou commented 3 years ago

content is used for BERT encoding and will be processed into BPE tokens. word is used for topic models and will be processed into bag-of-words representations. Actually, for English texts, content and word are just the same, which both refer to regular English words.

Here is the example:

{"session": [
    // Utterance
    {
     "content": ["some", "text"],
     "word": ["some", "text"],
     // Role info
     "type": "A"
    },
    {"content": ["some", "text"],
     "word": ["some", "text"],
     // Role info
     "type": "B"}, 

    ...
 ],
 "summary": ["summary", "text"]
}

Nemophilist-art commented 4 days ago

这个type可以写成别的吗或者不写

RowitZou / topic-dialog-summ

Data format required for training #4