facebookresearch / tart

Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.
Other
159 stars 11 forks source link

denoising.py #4

Open gunny97 opened 1 year ago

gunny97 commented 1 year ago

Hello, thank you for sharing your awesome work. I have a question related to arguments in denoising.py.

python denoising.py \ --task_name TASK_NAME \ --train_file PATH_TO_TRAIN_FILE \ --test_input_file output_dir/qa_data.json \ --model_name_or_path PATH_TO_DENOISING_MODEL_NAME \ --output_dir PATH_TO_OUTPUT_DIR \ --do_predict \ --evaluation_strategy steps \ --max_seq_length 512 --overwrite_cache --top_k 30 \ --instruction_file berri_instructions.tsv # only for creating tart-dual training data.

First, does 'task_name' defines the task of the dataset, such as 'summarization' or 'sentence phrase', or the name of the dataset, such as 'AGNews' or 'Altlex'? Also, I needed help understanding what to write on 'train_file,' 'model_name_or_path,' and 'output_dir.' I do not understand what path should be included in the argument 'train_file'. Does 'output_dir' designate 'output_dir' mentioned in the argument of passage_retrieval.py or the new path of denoised results? Can I put pre-trained Contriever model in 'model_name_or_path'? According to a paper, ms-marco-MiniLM-L-12-v2 was used, but the source code in 'denoising.py' calls EncT5 confused me. It would be very nice if you told me about them.

AkariAsai commented 1 year ago

Sorry for getting back to you late! Thank you so much for your patience and interest.

First, does 'task_name' defines the task of the dataset, such as 'summarization' or 'sentence phrase', or the name of the dataset, such as 'AGNews' or 'Altlex'?

This task_name field is in practice not used in the current version of the script, and you can simply remove the related sentences as mentioned in this issue. I will fix the script.

Also, I needed help understanding what to write on 'train_file,' 'model_name_or_path,' and 'output_dir.'

I will update the help of the preprocessing function. model_name_or_path indicates the denoising cross-encoder model, and output_dir is the name of the directory where the denoising predictions will be stored. train_file isn't actually used and I will clean up the denoising script.

Can I put pre-trained Contriever model in 'model_name_or_path'? According to a paper, ms-marco-MiniLM-L-12-v2 was used, but the source code in 'denoising.py' calls EncT5 confused me.

We use both models for denoising (i.e., TART-full is used to create denoised data for dual-encoder, while miniLM is used to create denoised data for TART-full). The current script is a refactored version of the original code and only supports the TART-full-based denoising. I will update the preprocessing script.