denoising.py - Githubissues

facebookresearch / tart

Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.

Other

159 stars 11 forks source link

Hello, thank you for sharing your awesome work. I have a question related to arguments in denoising.py.

python denoising.py \ --task_name TASK_NAME \ --train_file PATH_TO_TRAIN_FILE \ --test_input_file output_dir/qa_data.json \ --model_name_or_path PATH_TO_DENOISING_MODEL_NAME \ --output_dir PATH_TO_OUTPUT_DIR \ --do_predict \ --evaluation_strategy steps \ --max_seq_length 512 --overwrite_cache --top_k 30 \ --instruction_file berri_instructions.tsv # only for creating tart-dual training data.

First, does 'task_name' defines the task of the dataset, such as 'summarization' or 'sentence phrase', or the name of the dataset, such as 'AGNews' or 'Altlex'? Also, I needed help understanding what to write on 'train_file,' 'model_name_or_path,' and 'output_dir.' I do not understand what path should be included in the argument 'train_file'. Does 'output_dir' designate 'output_dir' mentioned in the argument of passage_retrieval.py or the new path of denoised results? Can I put pre-trained Contriever model in 'model_name_or_path'? According to a paper, ms-marco-MiniLM-L-12-v2 was used, but the source code in 'denoising.py' calls EncT5 confused me. It would be very nice if you told me about them.

Sorry for getting back to you late! Thank you so much for your patience and interest.

First, does 'task_name' defines the task of the dataset, such as 'summarization' or 'sentence phrase', or the name of the dataset, such as 'AGNews' or 'Altlex'?

This task_name field is in practice not used in the current version of the script, and you can simply remove the related sentences as mentioned in this issue. I will fix the script.

Also, I needed help understanding what to write on 'train_file,' 'model_name_or_path,' and 'output_dir.'

I will update the help of the preprocessing function. model_name_or_path indicates the denoising cross-encoder model, and output_dir is the name of the directory where the denoising predictions will be stored. train_file isn't actually used and I will clean up the denoising script.

Can I put pre-trained Contriever model in 'model_name_or_path'? According to a paper, ms-marco-MiniLM-L-12-v2 was used, but the source code in 'denoising.py' calls EncT5 confused me.

We use both models for denoising (i.e., TART-full is used to create denoised data for dual-encoder, while miniLM is used to create denoised data for TART-full). The current script is a refactored version of the original code and only supports the TART-full-based denoising. I will update the preprocessing script.

facebookresearch / tart

denoising.py #4