Closed mrmrn closed 6 years ago
You will want to export your data into text files. Each text file represents a conversation. Each line in the conversation is a response to the previous line.
If you can think of a better way to represent the data let me know. I am updating the sample.py over the next few days as well, so note that any models you train can't currently be sampled (temporary for now).
thank you for your reponse, I have about 100,000 questions in persian language and according to your answer I must create 100,000 txt file in this format:
fkljhdasklfhsdafsa.
sadfhsadfjsakfjsa.
fasdkjfsadkfjsadf.
the first line is the question.the second is the first answer. the third is reply to first answer.and the fourth line is the reply to line 3.
sometimes it may be 4 lines. and sometimes just 2 line.
is this ok? I hope this code will work without error by a persian dateset, and if it worked this will be the complete chatbot in persian language till now.
@mrmrn, sorry I missed your response to my comment, and am just seeing this now. You are correct, for single question-answer pairings that is exactly how to format it.
I'm not sure, it should work with Persian, but may not since the tokenizer makes assumptions about how tokens are delimited. This is meant to be language agnostic.
thank you very much
how can I create a corpus like cornell movie`s corpus? I have a bunch of questions&answers in MySQL format and I want use them n this code