Closed san7988 closed 2 years ago
content is used for BERT encoding and will be processed into BPE tokens. word is used for topic models and will be processed into bag-of-words representations. Actually, for English texts, content and word are just the same, which both refer to regular English words.
Here is the example:
{"session": [
// Utterance
{
"content": ["some", "text"],
"word": ["some", "text"],
// Role info
"type": "A"
},
{"content": ["some", "text"],
"word": ["some", "text"],
// Role info
"type": "B"},
...
],
"summary": ["summary", "text"]
}
这个type可以写成别的吗或者不写
Hi,
I'm trying to train an english model using SAMSum dataset. Based on my understanding so far, we'll have to convert the SAMSum data to the format that the code supports, which is:
Can you please help me understand what content and word pertain to? Each record in SAMSum data is in the format:
Thanks