About the data split of NQ 320K

hi-i-m-GTooth commented 4 months ago

Hi, Dr. Zhuang.

Thanks for your contribution again. I've successfully conducted some experiments on MSMARCO with DSI-QG. To keep going, I plan to conduct the experiments on NQ dataset. The only thing I want to make sure is:

Refer to NQ's huggingface, which says the data split is train: 307373 | dev: 7830. These values are quite close to the amount you mentioned in your work: "The NQ 320k dataset has ≈307k training query-document pairs and ≈8k dev query-document pairs."

May I treat 307373≈307k and 7830≈8k? (By the way, I am also curious about the reason for the number, dev: 6980, in MSMARCO-100K dataset.) Thanks for your confirmation in advanced!

ArvinZhuang commented 4 months ago

To be honest, Im also not quite sure why they call it NQ320k 😄. For msmarco, dev 6980 is actually a subset of full dev (The full dev is very big). But this subset is used for MSMARCO leaderbord submissions thus people just use this subset for papers. Check out this web page: https://microsoft.github.io/msmarco/Datasets.html

hi-i-m-GTooth commented 4 months ago

Oh, I see. I think it's a rough number XD. So may I ensure that you use the same setting as NQ's huggingface says?

And for MSMARCO: Thanks for your introduction; it helped me to understand your data processing further! But I am still curious about how you determine the number 6980. Thanks again, you just respond as fast as usual!

ArvinZhuang commented 4 months ago

yeah, I just use the NQ dataset from huggingface.

I did not decide the number of 6980, the official msmarco passage dev small has 6980 queries, so I just used all of them by default.

hi-i-m-GTooth commented 4 months ago

Thanks for the confirmation!! But I only found dev rather than dev-small, did you mean you reserve the dev rows that are also in train-small?

ArvinZhuang commented 4 months ago

is it not in the Queries link? I don't exactly remember where to find it, but it is the official dev set. Maybe you can use ir_dataset to download it: https://ir-datasets.com/msmarco-passage.html#msmarco-passage

ArvinZhuang commented 4 months ago

it is probably in this link collectionandqueries.tar.gz

hi-i-m-GTooth commented 4 months ago

Thanks for the patient explanation and the links. I finally saw the 6.8K in the links XD

hi-i-m-GTooth commented 3 months ago

Hi, Dr. Zhuang.

Sorry to bother you again, and I hope you are doing well!

According to your information, I set --train_num to 307373 and --eval_num to 7830 and tried reproducing the experiment on NQ-320K. The preprocessing code is modified from your old repo (The reason I modified it is to follow the logics/formats of new preprocessing codes).

However, I couldn't reach the score 82.36 for t5-base. I only reach 58.24. I've checked the training pairs and it seems normal with the QG ckpt mentioned in #12 . I also noticed that a document in NQ could be longer than max_seq_len for a transformer-based encoder. I would like to know did you do any other preprocessing for the training data and dev data? Thanks in advance.

Appendix

Below is the script I tried to reproduce the experiment on NQ-320K (with single A6000):

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 python3 run.py \
                --task "DSI" \
                --model_name "google-t5/t5-base" \
                --run_name "NQ-320k-baseline-t5-base-DSI-QG" \
                --max_length 32 \
                --train_file temp/nq_corpus.tsv.320k.q10.docTquery \
                --valid_file data/nq_data/320k/nq_DSI_dev_data.json \
                --output_dir "models/NQ-320k-t5-base-DSI-QG(baseline)" \
                --learning_rate 0.0005 \
                --warmup_steps 100000 \
                --per_device_train_batch_size 128 \
                --per_device_eval_batch_size 128 \
                --evaluation_strategy steps \
                --eval_steps 1000 \
                --max_steps 1000000 \
                --save_strategy steps \
                --dataloader_num_workers 10 \
                --save_steps 1000 \
                --save_total_limit 2 \
                --load_best_model_at_end \
                --gradient_accumulation_steps 2 \
                --report_to wandb \
                --logging_steps 100 \
                --dataloader_drop_last False \
                --metric_for_best_model Hits@10 \
                --greater_is_better True \
                --remove_prompt True

And here is the score and loss on the dashboard:

ArvinZhuang commented 3 months ago

Seems you were generating 10 queries per document? Maybe try to increase the number to 50

hi-i-m-GTooth commented 3 months ago

Thanks for the advice. Before I start reproducing, will this change increase performance by 20%, according to your gut feeling? Since my computation resources are not very generous, I think I should run this script more carefully.

ArvinZhuang commented 3 months ago

according to my experience more generated queries is always better, but indeed it will take even longer to converge..

hi-i-m-GTooth commented 3 months ago

Alright ... This is life ... 😢

hi-i-m-GTooth commented 3 months ago

Hi, Dr. Zhuang.

I've finished the training with query generation number = 50. Compared to HIT@10 = 58.24 from query generation number = 10, it makes progress to HIT@10 = 69.12. However, it still doesn't reach HIT@10 = 82.36. Is it normal or as expected for HIT@10 to only reach 69.12 in this setting?

If it is normal, should I try:

Increase the query generation number to 100
CE reranker m = 50 (but may I acquire the training args for NQ dataset? As for training data, I think I could generate them according to DPR format by myself.)

Thanks for the passionate reply!

ArvinZhuang commented 3 months ago

how many steps you have trained? maybe go ahead with 100 queries

hi-i-m-GTooth commented 3 months ago

I've trained about 300k steps.

So Dr. Zhuang thinks I should try query generation number 100 without CE reranker? Will CE reranker have a significant influence on this situation?

ArvinZhuang commented 3 months ago

there is no need to use CE when use 100 queries

hi-i-m-GTooth commented 3 months ago

Ok. I'll try 100 queries first. Thanks for the confirmation! Nevertheless, I have one more concern: what's the --max_length you used to generate queries in NQ Dataset?

Thanks again!

yuxiang-guo commented 1 month ago

@hi-i-m-GTooth Hi GTooth.

Thanks for showing your reproduced result.

I have tried to run the code on another dataset with less size than NQ320k, but after 80k steps, hit@1 just reaches 0.03. I don't know why it is.

Your figure shows that when training 50k steps, the hit@1 already achieves 0.2. Running 50k steps means just 2 or 3 epochs, since it is about 228k unique documents in NQ320K, and if you generate 10 queries for each docs, there will be 11 * 228k training samples in total. Is my understanding correct?

Thanks for your confirmation in advance!

ArvinZhuang commented 1 month ago

hi @yuxiang-guo In my code, I actually did not use the original doc at all. So that will be 10 * 228k training samples/

If you are getting very low sores I suggest you have a look at generated queries. Do they look correct?

hi-i-m-GTooth commented 1 month ago

Hi @yuxiang-guo ,

@ArvinZhuang already confirmed it XD.

But I want to add: According to my experience, batch_size is also a critical factor. You could also try to increase the batch_size or gradient_accumulation_steps.

yuxiang-guo commented 1 month ago

Hi @ArvinZhuang

Thanks for your reply!

I checked my generated queries. They looked quite correct but different from the queries in the test set. Since the docs in my used datasets are more semantically rich, and thus each doc can correlate with many diverse queries. So for a doc, if the generated queries are not the same as the true query in the test set, the accuracy would be significantly affected?

yuxiang-guo commented 1 month ago

Hi @hi-i-m-GTooth

Thanks for your reply and suggestions! I will try it.

ArvinZhuang / DSI-QG

About the data split of NQ 320K #15

Appendix