Open nonstopfor opened 4 years ago
Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.
Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.
Could you give a complete pipeline? Suppose I have origin train, valid and test data (Just some native dialogs). What are the steps I need to fine tune MaUde and do inference?
Could you tell me in ParlAIExtractor,which function is used to read data from the file? Beacuse finding data format from more than 1000 lines of code in data.py is really a hard work......
Also, when computing backtranslation and corruption files, what should the data format be?
The extract_interactions
def from ParlAIExtractor is used to build the data. I would suggest you to read ParlAI docs to understand how the data is internally represented, as this repo is heavily dependent on it (as in, we don't have a standard input/output file).
@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.
@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.
In this directory, what is the origin data file? And what do these .csv files come from?
I want to fine tune MaUde on my own data and use the fine tuned model to do inference. But I don't know the right data format (including train data and test data) . Does anyone know that?