What should be the right data format for fine-tuning and inference?

facebookresearch / online_dialog_eval

Code for the paper "Learning an Unreferenced Metric for Online Dialogue Evaluation", ACL 2020

Other

28 stars 7 forks source link

What should be the right data format for fine-tuning and inference? #5

Open nonstopfor opened 4 years ago

nonstopfor commented 4 years ago

I want to fine tune MaUde on my own data and use the fine tuned model to do inference. But I don't know the right data format (including train data and test data) . Does anyone know that?

koustuvsinha commented 4 years ago

Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.

nonstopfor commented 4 years ago

Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.

Could you give a complete pipeline? Suppose I have origin train, valid and test data (Just some native dialogs). What are the steps I need to fine tune MaUde and do inference?

nonstopfor commented 4 years ago

Could you tell me in ParlAIExtractor，which function is used to read data from the file? Beacuse finding data format from more than 1000 lines of code in data.py is really a hard work......

nonstopfor commented 4 years ago

Also, when computing backtranslation and corruption files, what should the data format be?

koustuvsinha commented 4 years ago

The extract_interactions def from ParlAIExtractor is used to build the data. I would suggest you to read ParlAI docs to understand how the data is internally represented, as this repo is heavily dependent on it (as in, we don't have a standard input/output file).

koustuvsinha commented 4 years ago

@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.

nonstopfor commented 4 years ago

@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.

In this directory, what is the origin data file? And what do these .csv files come from?