[data] Samsum dataset - Githubissues

dohnlee commented 3 years ago

abstractive summarize task를 위한 paired data 중 하나인 saumsum corpus dataset입니다.

dohnlee commented 3 years ago

dialogue dataset ( dialogue - summarize pair )
대부분 2명의 화자
train : 14732
val : 818
test : 819

dohnlee commented 3 years ago

python data.py 실행

data/samsum_corpus 디렉토리 생성

data
└── samsum_corpus
    ├── README.txt
    ├── licence.txt
    ├── test.json
    ├── train.json
    └── val.json

다운로드 할 때 train-test split이 되어 있기 때문에 실험 코드에서 따로 split 할 필요 없음

data sample

{'id': '13818513',
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}

fluent-python-study / project-meat-chatbot-ml

[data] Samsum dataset #3