Closed xipq closed 12 months ago
Hi @xipq, thank you for interest. It depends on hyperparameters such as batch size. could you share your full hyperparameters and the screenshot of when you run the command (which would display the time spent as well). I'll look into it and see if it is expected.
Hi, thanks for your reply. The command I've used was:
method="channel-metaicl"
task="hr_to_lr"
out_dir="checkpoints/${method}/${task}"
checkpoint="checkpoints/${method}/${task}/model-30000.pt"
seed=100,13,21,42,87
bs=16
CUDA_VISIBLE_DEVICES=0 python test.py \
--task $task --k 16 --split test --seed $seed \
--use_demonstrations \
--test_batch_size $bs \
--method channel \
--checkpoint $checkpoint \
--out_dir $out_dir
The test ended as:
08/04/2023 15:21:42 - INFO - __main__ - checkpoints/channel-metaicl/hr_to_lr/tweet_eval-stance_feminist-test-channel-k=16-s=87.pkl
08/04/2023 15:21:42 - INFO - __main__ - torch.Size([201, 1024])
08/04/2023 15:22:22 - INFO - __main__ - Accuracy=0.4399466933200067
08/04/2023 15:22:22 - INFO - __main__ - Macro-F1 of hr_to_lr over 26 target tasks: 44.7
Sorry that I haven't kept the full console logs during evaluation, so I would show the ls -lrt
of .pkl
and .txt
files generated and their timestamps, as follows:
-rw-r--r-- 1 5012 Aug 4 03:37 'quarel-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 3729 Aug 4 03:37 'quarel-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 12241 Aug 4 03:41 'financial_phrasebank-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 3770 Aug 4 03:41 'financial_phrasebank-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 18010 Aug 4 03:52 'openbookqa-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 9936 Aug 4 03:52 'openbookqa-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 20028 Aug 4 04:00 'codah-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 17444 Aug 4 04:00 'codah-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 66694 Aug 4 04:25 'qasc-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 10961 Aug 4 04:25 'qasc-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 7352 Aug 4 04:28 'glue-mrpc-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 5228 Aug 4 04:28 'glue-mrpc-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 55100 Aug 4 04:49 'dream-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 47166 Aug 4 04:49 'dream-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 13375 Aug 4 04:54 'sick-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 6222 Aug 4 04:54 'sick-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 54965 Aug 4 05:14 'commonsense_qa-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 13127 Aug 4 05:14 'commonsense_qa-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 10990 Aug 4 05:18 'medical_questions_pairs-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 5645 Aug 4 05:18 'medical_questions_pairs-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 6920 Aug 4 05:21 'quartz-with_knowledge-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 3894 Aug 4 05:21 'quartz-with_knowledge-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 2843 Aug 4 05:22 'poem_sentiment-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 1015 Aug 4 05:22 'poem_sentiment-test-channel-k=16-s=100.txt'
-rw-r--r-- 1 6920 Aug 4 05:25 'quartz-no_knowledge-test-channel-k=16-s=100.pkl'
-rw-r--r-- 1 3896 Aug 4 05:25 'quartz-no_knowledge-test-channel-k=16-s=100.txt'
...
-rw-r--r-- 1 6920 Aug 4 14:46 'quartz-no_knowledge-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 3927 Aug 4 14:46 'quartz-no_knowledge-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1286 Aug 4 14:47 'glue-wnli-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 869 Aug 4 14:47 'glue-wnli-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 11062 Aug 4 14:51 'climate_fever-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 3438 Aug 4 14:51 'climate_fever-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1574 Aug 4 14:51 'ethos-national_origin-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 503 Aug 4 14:51 'ethos-national_origin-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1574 Aug 4 14:52 'ethos-race-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 501 Aug 4 14:52 'ethos-race-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1574 Aug 4 14:52 'ethos-religion-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 502 Aug 4 14:52 'ethos-religion-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 10756 Aug 4 14:57 'ai2_arc-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 9153 Aug 4 14:57 'ai2_arc-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 38554 Aug 4 15:11 'hate_speech18-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 13505 Aug 4 15:11 'hate_speech18-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 4994 Aug 4 15:13 'glue-rte-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 3623 Aug 4 15:13 'glue-rte-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1520 Aug 4 15:13 'superglue-cb-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 598 Aug 4 15:13 'superglue-cb-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 3716 Aug 4 15:14 'superglue-copa-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1808 Aug 4 15:14 'superglue-copa-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 17992 Aug 4 15:21 'tweet_eval-hate-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 7119 Aug 4 15:21 'tweet_eval-hate-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 340 Aug 4 15:21 'tweet_eval-stance_atheism-test-channel-k=16-s=87.txt'
-rw-r--r-- 1 1412 Aug 4 15:21 'tweet_eval-stance_atheism-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 1817 Aug 4 15:22 'tweet_eval-stance_feminist-test-channel-k=16-s=87.pkl'
-rw-r--r-- 1 435 Aug 4 15:22 'tweet_eval-stance_feminist-test-channel-k=16-s=87.txt'
Thanks in advance!
Hi, I'm sorry for the late reply, for some reason I saw it now.
Hmm, based on the timstamps of the files created, it does look to me that the time being spent is reasonable. The only thing it seems like is there is a big gap between quartz-no_knowledge-test-channel-k=16-s=100
and quartz-no_knowledge-test-channel-k=16-s=87
which might be due to external reasons? And definitely some datasets are larger and takes more time, e.g., the dream dataset. In that case, it might be OK to just exclude the data (for the preliminary experiments at least). Also if you have multiple GPUs and you would like to parallelize experiments for speed-up, you can specify the dataset names as arguments and run them in parallel.
Hi, I would like to know how long would the inference testing on a single task (e.g.
metaicl
orchannel-metaicl
onnon_qa_to_qa
) last. I followed the provided command but experiments on a single config lasted for hours on a V100. I would like to know whether this is abnormal. Thanks.