albertqjiang / INT

Official code for paper: INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
MIT License
37 stars 6 forks source link

Correct execution of reproducing the experiments in the paper? #18

Open fzyzcjy opened 1 year ago

fzyzcjy commented 1 year ago

@albertqjiang Hi thanks for the paper with code! I wonder what is the correct command to trigger training that can reproduce the accuracy numbers in the figures (e.g. figure 2 left)?

I have tried the command in training section of README, but it looks quite different than the reported results in the paper. For example, soon it starts to have very high training accuracy (~94%), as can be seen from the following (collapsed) code block, contrary to roughly 75% in figure 2 left.

``` ... generate datasets #Generated problems: 50 #Generated problems: 100 #Generated problems: 150 #Generated problems: 200 #Generated problems: 250 #Generated problems: 300 #Generated problems: 350 #Generated problems: 400 #Generated problems: 450 #Generated problems: 500 #Generated problems: 550 #Generated problems: 600 #Generated problems: 650 #Generated problems: 700 #Generated problems: 750 #Generated problems: 800 #Generated problems: 850 #Generated problems: 900 #Generated problems: 950 #Generated problems: 1000 prob 1 fail! prob 2 success! prob 3 success! prob 4 success! prob 5 success! prob 6 success! prob 7 success! prob 8 success! prob 9 success! prob 10 success! prob 11 success! prob 12 success! prob 13 success! prob 14 success! prob 15 success! prob 16 success! prob 17 success! prob 18 success! prob 19 success! prob 20 success! prob 21 success! prob 22 success! prob 23 success! prob 24 success! prob 25 success! prob 26 success! prob 27 success! prob 28 success! prob 29 success! prob 30 success! prob 31 success! prob 32 success! prob 33 success! prob 34 success! prob 35 success! prob 36 success! prob 37 success! prob 38 success! prob 39 success! prob 40 success! prob 41 success! prob 42 fail! prob 43 success! prob 44 success! prob 45 success! prob 46 success! prob 47 success! prob 48 success! prob 49 success! prob 50 success! prob 51 success! prob 52 success! prob 53 success! prob 54 success! prob 55 success! prob 56 success! prob 57 success! prob 58 success! prob 59 success! prob 60 success! prob 61 success! prob 62 success! prob 63 success! prob 64 success! prob 65 success! prob 66 success! prob 67 success! prob 68 success! prob 69 success! prob 70 success! prob 71 success! prob 72 success! prob 73 success! prob 74 success! prob 75 success! prob 76 success! prob 77 fail! prob 78 success! prob 79 success! prob 80 success! prob 81 success! prob 82 success! prob 83 success! prob 84 success! prob 85 fail! prob 86 fail! prob 87 success! prob 88 success! prob 89 success! prob 90 success! prob 91 success! prob 92 success! prob 93 success! prob 94 success! prob 95 fail! prob 96 success! prob 97 success! prob 98 success! prob 99 success! prob 100 success! ... ```

I will try to play with the code a little bit more, but want to firstly create this issue as a quick feedback that the README (or the code?) may not be super consistent with the paper :/

albertqjiang commented 1 year ago

Is this k=3 l=7?

fzyzcjy commented 1 year ago

Here is the command:

CUDA_VISIBLE_DEVICES=2 python -m int_environment.algos.main --combo_path data/benchmark/field --dump /data/jychen/misc/int_code_results/pt_models/ --online --train_sets k\=5_l\=5 --test_sets k\=5_l\=5 --epochs_per_online_dataset 10 --num_probs 1000 --lr 1e-4 --updates 1000000 --transform_gt --degree 0 --seed 0 --epoch_per_case_record 200

Thus I guess it is K=5 L=5

Basically it is the command from README.md but with full names instead of brief names in the command line arguments

And I am using https://github.com/fzyzcjy/INT (code diff: https://github.com/albertqjiang/INT/pull/15/files), but IMHO there is no semantics changes except for some logging etc

albertqjiang commented 1 year ago

Ah I think there could be two issues: (1) I used a larger number of axiom combinations and orders in the paper than shown in the example (otherwise the example would be too slow).

(2) Figure 2 should be for both equalities and inequalities. Therefore one needs to generate ordered_field axiom combos and orders.

i.e., when generating the axiom combinations and orders, try python -m int_environment.data_generation.combos_and_orders --combo_path data/benchmark/ordered_field --max_k 5 --max_l 5 --trial 1000000. This should take ~20 mins if memory serves.

fzyzcjy commented 1 year ago

Thank you for the information!

So I will try the following commands and report results after it is finished :)

python -m int_environment.data_generation.combos_and_orders --combo_path /data/jychen/misc/int_code_results/data_benchmark_k5_l5/ordered_field --max_k 5 --max_l 5 --trial 1000000

CUDA_VISIBLE_DEVICES=2 python -m int_environment.algos.main --combo_path /data/jychen/misc/int_code_results/data_benchmark_k5_l5/ordered_field --dump /data/jychen/misc/int_code_results/pt_models/ --online --train_sets k\=5_l\=5 --test_sets k\=5_l\=5 --epochs_per_online_dataset 10 --num_probs 1000 --lr 1e-4 --updates 1000000 --transform_gt --degree 0 --seed 0 --epoch_per_case_record 200 | tee /data/jychen/misc/int_code_results/logs/$(date +%s).log
fzyzcjy commented 1 year ago

In addition, I would appreciate it if I could know how to reproduce the "Transformer" results? IMHO the readme does not mention any commands to train a transformer or seq2seq model (except for data generation).

venom12138 commented 1 year ago

@fzyzcjy Hi, do you know how to reproduce the transformer result now? I'm also interested in it.

fzyzcjy commented 1 year ago

@venom12138 Hi, I did not reproduce it either, and later tried other papers.