How to create a corpuss like cornell movie format? #2

domerin0 / neural-chatbot

A chatbot based on seq2seq architecture done with tensorflow.

196 stars 81 forks source link

How to create a corpuss like cornell movie format? #2 #21

Closed mrmrn closed 6 years ago

mrmrn commented 7 years ago

how can I create a corpus like cornell movie`s corpus? I have a bunch of questions&answers in MySQL format and I want use them n this code

domerin0 commented 7 years ago

You will want to export your data into text files. Each text file represents a conversation. Each line in the conversation is a response to the previous line.

If you can think of a better way to represent the data let me know. I am updating the sample.py over the next few days as well, so note that any models you train can't currently be sampled (temporary for now).

mrmrn commented 7 years ago

thank you for your reponse, I have about 100,000 questions in persian language and according to your answer I must create 100,000 txt file in this format:

 fkljhdasklfhsdafsa.
sadfhsadfjsakfjsa.
 fasdkjfsadkfjsadf.

the first line is the question.the second is the first answer. the third is reply to first answer.and the fourth line is the reply to line 3.

sometimes it may be 4 lines. and sometimes just 2 line.

is this ok? I hope this code will work without error by a persian dateset, and if it worked this will be the complete chatbot in persian language till now.

domerin0 commented 7 years ago

@mrmrn, sorry I missed your response to my comment, and am just seeing this now. You are correct, for single question-answer pairings that is exactly how to format it.

I'm not sure, it should work with Persian, but may not since the tokenizer makes assumptions about how tokens are delimited. This is meant to be language agnostic.

mrmrn commented 6 years ago

thank you very much