Open aqsa27 opened 2 years ago
Hi @aqsa27, sorry for the delay. Could you please try this quick toy example:
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip
unzip fiqa.zip
export dataset="fiqa"
python -m gpl.train \
--path_to_generated_data "generated/$dataset" \
--base_ckpt "distilbert-base-uncased" \
--gpl_score_function "dot" \
--batch_size_gpl 4 \
--gpl_steps 100 \
--new_size 10 \
--queries_per_passage 1 \
--output_dir "output/$dataset" \
--evaluation_data "./$dataset" \
--evaluation_output "evaluation/$dataset" \
--generator "BeIR/query-gen-msmarco-t5-base-v1" \
--retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
--retriever_score_functions "cos_sim" "cos_sim" \
--cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
--qgen_prefix "qgen"
I have just tried this (building the env from scratch) and it works. And please keep the same data format and argument format in the example (e.g. please make sure --path_to_generated_data
gets a directory instead of a file).
@aqsa27 your target corpus should be called 'corpus.jsonl' to start with. also you have to have the folder 'generated/CustomDB' . The times I got that "mnrl positional " error , was either i had BeIR format but not specifically called corpus.jsonl or the folder structure was off.
Thanks for sharing your experience @ahadda5 and pointing out the misleading point! I will add some assertions about it to give hints and make it more clear.
Now mnrl_** are set to None
by default and one will not be bothered with MNRL (i.e. the baseline QGen) issues: https://github.com/UKPLab/gpl/pull/12
does anyone have a working example they could share that contains their folder structure?
Hi @christopherfeld, I have created google colab showing how to run this toy example. Please have a look at here: https://colab.research.google.com/drive/1Wis4WugIvpnSAc7F7HGBkB38lGvNHTtX?usp=sharing and hope this can help:)
Hi,
I have created a custom corpus.jsonl in the format structure as instructed. I am successfully able to install the library the gpl library on mac machine.
I use the following piece of code: import gpl
dataset = 'fiqa' gpl.train( path_to_generated_data=f"generated/{dataset}", base_ckpt="distilbert-base-uncased",
base_ckpt='GPL/msmarco-distilbert-margin-mse',
)
I have changed the following paths:
path_to_generated_data=f"generated/{dataset}", Here i am adding my path to custom data corpus.jsonl
as i run this file, I get the following error:
train() missing 2 required positional arguments: " mnrl_output_dir" and "mnrl_evaluation_output"
My purpose here is to do domain adaption for questions in form of sentences for semantic search task.
Please let me know what would be the exact steps to train on custom data ?