InternLM / InternLM-Math

Apache License 2.0
354 stars 20 forks source link

Model evaluation on minif2f fails? #22

Open Lagooon opened 1 month ago

Lagooon commented 1 month ago

When I evaluate InternLM2-Math-Plus-7b in minif2f through this code, it fails. The model only generates one line "Here is the predicted next tactic:" without any tactics. If I let the model continue generating until I get a tactic each time, I only get a pass rate 38.1% instead of 43.4%.

objecti0n commented 1 month ago

Do you modify any part of our code during evaluation? What is your LEAN & mathlib version?

objecti0n commented 1 month ago

There is a tokenizer issue in our code. We will update it as soon as possible.

wzj423 commented 1 month ago

Thanks for your reply. We have fixed an issue where the script used incorrect prompts for the model/tokenizer. https://github.com/InternLM/InternLM-Math/pull/23 Now, it can generate tactics properly. As for the pass rate issue, the search time is CPU-dependent, so you can adjust the limit for the number of iterations and running time. Our results are available in output.zip. Unzip the file, and you will see our outputs and the hyperparameters in detail.

Lagooon commented 1 month ago

The code now can generate tactic normally, but I get a pass rate about 36%. I don't think search time has that much of an impact since almost all searches that didn't get stuck completed 100 iterations. I also tried doubling the search time but the performance improvement was <1%. My Lean version is 4.8.0-nightly-2024-04-14. I found that the output.zip was not completely obtained using this code and scripts. For example, there are "state_after" in "sstrace" which will not be the output of this version of the code. Could you provide the version of the code that outputs the output.zip or provide the output file obtained using this code and scripts?

wzj423 commented 1 month ago

I have re-evaluated MiniF2F, and it scored 103 out of 244, which is essentially the same as previously reported. Model performance can vary slightly since inference and search are not fully deterministic. For instance, the problem 'algebra_sqineq_unitcircatbpabsamblt1' was solved in the re-evaluation but not in the initial evaluation. I have also attached my evaluation script, which is almost identical to the script in the repository. One of the few differences is that I noticed vllm had changed their default value for the length_penalty back in February (https://github.com/vllm-project/vllm/pull/2667), so I explicitly set it back to 0 in case you are using a newer version of vllm. As for the "state_after"s in the output.zip, they are just for better pretty printing and make no real difference to the search process. I hope this information is helpful to you. internLM2-plus-7b_minif2f_test.zip

Lagooon commented 1 month ago

I use this proofsearch script and meet an error: "TypeError: init() got an unexpected keyword argument 'additional_imports'". I use lean-dojo==1.1.2 as in requirements.txt. Could you provide the the lean and lean-dojo version?

wzj423 commented 1 month ago

Hi, additional_imports is a feature in newer version of lean-dojo, see ( https://github.com/lean-dojo/LeanDojo/commit/d04c4b4b70251a981b67c51ae843ee30cd7e4bf7 ). We are using lean-dojo 1.7.0

objecti0n commented 1 month ago

Please see https://github.com/InternLM/InternLM-Math/pull/28

Lagooon commented 1 month ago

With lean-dojo 1.7.0, I can't build dataset successfully. Are there any changes in the data files?

  File "/tmp/lean/tmpj08rl4gd/workspace/build_lean4_repo.py", line 200, in <module>
    main()
  File "/tmp/lean/tmpj08rl4gd/workspace/build_lean4_repo.py", line 194, in main
    run_cmd(cmd, capture_output=True)
  File "/tmp/lean/tmpj08rl4gd/workspace/build_lean4_repo.py", line 29, in run_cmd
    res = subprocess.run(cmd, shell=True, capture_output=capture_output, check=True)
  File "/home/lean2/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'lake env lean --threads 32 --run ExtractData.lean' returned non-zero exit status 1.
objecti0n commented 1 month ago

The scripts fetch MiniF2F from [https://github.com/rah4927/lean-dojo-mew], which uses leanprover/lean4:nightly-2023-08-19. To use newer versions of Lean4, slight modifications may be needed. We provide a version of MiniF2F that works with leanprover/lean4:v4.7.0 at [https://github.com/wzj423/lean-dojo-mew]. To use a custom source for the MiniF2F dataset, change the url and commit fields in the minif2f/data/minif2f.jsonl file.

Lagooon commented 1 month ago

ExtractData still does not work properly. There are some warning logs

warning: manifest out of date: git url of dependency 'mathlib' changed; use `lake update mathlib` to update it
warning: manifest out of date: git revision of dependency 'mathlib' changed; use `lake update mathlib` to update it
warning: manifest out of date: git url of dependency 'aesop' changed; use `lake update aesop` to update it
warning: manifest out of date: git revision of dependency 'aesop' changed; use `lake update aesop` to update it
......
2024-06-09 04:19:01.078 | WARNING  | __main__:check_files:132 - Missing /tmp/lean/tmpe_7nk7u_/workspace/lean-dojo-mew/.lake/packages/proofwidgets/.lake/build/ir/ProofWidgets/Demos/InteractiveSvg.ast.json
2024-06-09 04:19:01.078 | WARNING  | __main__:check_files:132 - Missing /tmp/lean/tmpe_7nk7u_/workspace/lean-dojo-mew/.lake/packages/proofwidgets/.lake/build/ir/ProofWidgets/Demos/Euclidean.ast.json
......
2024-06-09 04:19:01.078 | WARNING  | __main__:check_files:132 - Missing /tmp/lean/tmpe_7nk7u_/workspace/lean-dojo-mew/.lake/packages/proofwidgets/.lake/build/ir/ProofWidgets/Demos/Euclidean.ast.json
2024-06-09 04:19:01.079 | WARNING  | __main__:check_files:132 - Missing /tmp/lean/tmpe_7nk7u_/workspace/lean-dojo-mew/.lake/packages/proofwidgets/.lake/build/ir/ProofWidgets/Demos/Rubiks.dep_paths
2024-06-09 04:19:01.079 | WARNING  | __main__:check_files:132 - Missing /tmp/lean/tmpe_7nk7u_/workspace/lean-dojo-mew/.lake/packages/proofwidgets/.lake/build/ir/ProofWidgets/Demos/RbTree.dep_paths

And then all Dojo(theorems) will end with EOFError.

wzj423 commented 1 month ago

I have rerun the tracing process, and it runs smoothly with logs as follows:

./././MiniF2F/Validation.lean:1133:8: warning: declaration uses 'sorry'
./././MiniF2F/Validation.lean:1138:36: warning: unused variable `h₀` [linter.unusedVariables]
[1558/1560] Building MiniF2F.All
[1559/1560] Building MiniF2F
2024-06-09 10:11:25.493 | INFO     | __main__:main:193 - Tracing lean-dojo-mew
2024-06-09 10:11:25.497 | DEBUG    | __main__:main:198 - lake env lean --threads 30 --run ExtractData.lean noDeps
  0%|                                                                                                                                                   | 0/5 [00:05<?, ?it/s]
  0%|                                                                                                                                                   | 0/5 [00:10<?, ?it/s]
 60%|███████████████████████████████████████████████████████████████████████████████████▍                                                       | 3/5 [00:20<00:03,  1.67s/it]2024-06-09 10:11:55.302 | DEBUG    | lean_dojo.data_extraction.traced_data:from_traced_files:1520 - Lost files: [], 5 of 5 files succesfully traced.
2024-06-09 10:11:55.303 | DEBUG    | lean_dojo.data_extraction.traced_data:from_traced_files:1524 - Parsing 5 *.ast.json files in /tmp/tmpcgd93umm/lean-dojo-mew with 29 workers
2024-06-09 10:12:00,364 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
                                                                                                                                                                             (pid=443414) 2024-06-09 10:12:03.774 | DEBUG    | lean_dojo.data_extraction.lean:<module>:42 - Using GitHub personal access token for authentication     | 0/5 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.08s/it]
(pid=443432) 2024-06-09 10:12:04.061 | DEBUG    | lean_dojo.data_extraction.lean:<module>:42 - Using GitHub personal access token for authentication [repeated 28x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
2024-06-09 10:12:13.333 | DEBUG    | lean_dojo.data_extraction.traced_data:save_to_disk:1573 - Saving 5 traced XML files to /tmp/tmpcgd93umm/lean-dojo-mew with 29 workers
2024-06-09 10:12:19,196 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
                                                                                                                                                                             (pid=448406) 2024-06-09 10:12:21.714 | DEBUG    | lean_dojo.data_extraction.lean:<module>:42 - Using GitHub personal access token for authentication     | 0/5 [00:00<?, ?it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.15s/it]
(pid=448430) 2024-06-09 10:12:22.216 | DEBUG    | lean_dojo.data_extraction.lean:<module>:42 - Using GitHub personal access token for authentication [repeated 28x across cluster]

2024-06-09 10:21:00.108 | DEBUG    | lean_dojo.interaction.dojo:__enter__:196 - Copying repo in /cpfs01/user/XXXXXXXX/slow_home/miniF2f_autograde/.cache/lean_dojo/wzj423-lean-dojo-mew-1ef4e4cac9dd370b7be6d648ce135a06aa6fce5f/lean-dojo-mew to /tmp/tmpzj5ypt_i/lean-dojo-mew
2024-06-09 10:30:42.687 | DEBUG    | lean_dojo.interaction.dojo:__enter__:212 - Repo copyed!
2024-06-09 10:30:48.117 | DEBUG    | lean_dojo.interaction.dojo:_modify_file:390 - Modifying MiniF2F/Test.lean
2024-06-09 10:30:48.199 | DEBUG    | lean_dojo.interaction.dojo:__enter__:231 - Launching the proof using <class 'lean_dojo.container.NativeContainer'>
2024-06-09 10:30:53.057 | DEBUG    | lean_dojo.container:run:184 - lake build Lean4Repl
2024-06-09 10:31:29.556 | DEBUG    | lean_dojo.container:run_interactive:214 - lake env lean --threads=1 --memory=32768 MiniF2F/Test.lean
2024-06-09 10:31:30.326 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - warning: manifest out of date: git url of dependency 'mathlib' changed; use `lake update mathlib` to update it
2024-06-09 10:31:30.326 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - warning: manifest out of date: git revision of dependency 'mathlib' changed; use `lake update mathlib` to update it
2024-06-09 10:31:30.326 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - warning: manifest out of date: git url of dependency 'aesop' changed; use `lake update aesop` to update it
2024-06-09 10:31:30.326 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - warning: manifest out of date: git revision of dependency 'aesop' changed; use `lake update aesop` to update it
2024-06-09 10:31:33.511 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - REPL> {"tacticState": "b h v : ℝ\nh₀ : 0 < b ∧ 0 < h ∧ 0 < v\nh₁ : v = 1 / 3 * (b * h)\nh₂ : b = 30\nh₃ : h = 13 / 2\n⊢ v = 65", "sid": 0, "error": null}
                                                                                                                                                                             2024-06-09 10:31:38.600 | DEBUG    | lean_dojo.interaction.dojo:_submit_request:514 - Request: {"sid": 0, "cmd": "rw [h₁, h₂, h₃]"}                    | 0/100 [00:00<?, ?it/s]
2024-06-09 10:31:38.605 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - REPL> {"tacticState": "b h v : ℝ\nh₀ : 0 < b ∧ 0 < h ∧ 0 < v\nh₁ : v = 1 / 3 * (b * h)\nh₂ : b = 30\nh₃ : h = 13 / 2\n⊢ 1 / 3 * (30 * (13 / 2)) = 65", "sid": 1, "error": null}
2024-06-09 10:31:38.605 | DEBUG    | lean_dojo.interaction.dojo:_submit_request:514 - Request: {"sid": 0, "cmd": "field_simp [h₀.1, h₀.2.1, h₀.2.2] at h₁ ⊢"}
......
2024-06-09 10:31:41.063 | DEBUG    | lean_dojo.interaction.dojo:_submit_request:514 - Request: {"sid": 0, "cmd": "linarith"}
2024-06-09 10:31:41.119 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:557 - REPL> {"tacticState": null, "sid": null, "error": "linarith failed to find a contradiction\ncase h1.h\nb h v : ℝ\nh₀ : 0 < b ∧ 0 < h ∧ 0 < v\nh₁ : v = 1 / 3 * (b * h)\nh₂ : b = 30\nh₃ : h = 13 / 2\na✝ : v < 65\n⊢ False\nfailed"}

I suspect that this is a lean-dojo issue, which is beyond the scope of this issue. However, I personally have some suggestions: lean-dojo's ExtractData.lean is known to occasionally cause some OOM issues, see [ https://github.com/lean-dojo/LeanDojo/issues/64 ]. However, one doesn't need to trace any of the dependencies (you are tracing ProofWidgets in your log, which is not needed) to use the MiniF2F dataset. One HACK is to explicitly tell lean-dojo NOT to trace the dependencies, which means changing

# Copy and `cd` into the repo.
            traced_repo_path = get_traced_repo_path(self.repo)

to

# Copy and `cd` into the repo.
            traced_repo_path = get_traced_repo_path(self.repo, build_deps = False)

at [https://github.com/lean-dojo/LeanDojo/blob/4d61bdd2f2b6d5ec85050798b3efc2e1234b6960/src/lean_dojo/interaction/dojo.py#L175 ], which might relieve the memory burden. (It is weird that the author of lean-dojo does not expose this API.)

Lagooon commented 1 month ago

It still don't work. There are the logs.

2024-06-09 18:45:11.876 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:1:0: error: object file './.lake/packages/mathlib/.lake/build/lib/Mathlib/Tactic.olean' of module Mathlib.Tactic does not exist
2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:13:5: error: unknown namespace 'BigOperators'
2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:15:5: error: unknown namespace 'Nat'
2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:17:5: error: unknown namespace 'Real'
2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:19:5: error: unknown namespace 'Rat'
2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:21:46: error: expected token                                                                                                            2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:26:49: error: unexpected token '*'; expected ':=', 'where' or '|'                                                                       2024-06-09 18:45:11.877 | DEBUG    | lean_dojo.interaction.dojo:_read_next_line:525 - MiniF2F/Test.lean:30:43: error: expected token
......

These errors will also appear when build_deps = True. Also, this doesn't look like an OOM error. I've monitored the memory consumption, it was <32GB and I have 256GB.

objecti0n commented 3 weeks ago

We cannot reproduce your problem for now. Could you provide more possible details.

Lagooon commented 3 weeks ago

I use this code and just modify DATA="data/minif2f-lean4.7.0.jsonl" and VERBOSE="1" in eval_internLM2-plus_7b.sh. I reproduced this error in various platforms. This is the full logs: internLM2-plus-7b_shard0.txt

objecti0n commented 1 week ago

I use this code and just modify DATA="data/minif2f-lean4.7.0.jsonl" and VERBOSE="1" in eval_internLM2-plus_7b.sh. I reproduced this error in various platforms. This is the full logs: internLM2-plus-7b_shard0.txt

We are working on reproduce and fix this problem, will reply to you as soon as possible.

Lagooon commented 1 week ago

I use this code and just modify DATA="data/minif2f-lean4.7.0.jsonl" and VERBOSE="1" in eval_internLM2-plus_7b.sh. I reproduced this error in various platforms. This is the full logs: internLM2-plus-7b_shard0.txt

We are working on reproduce and fix this problem, will reply to you as soon as possible.

It can work properly with repo https://github.com/yangky11/miniF2F-lean4. Maybe this info will help.

objecti0n commented 1 week ago

I use this code and just modify DATA="data/minif2f-lean4.7.0.jsonl" and VERBOSE="1" in eval_internLM2-plus_7b.sh. I reproduced this error in various platforms. This is the full logs: internLM2-plus-7b_shard0.txt

We are working on reproduce and fix this problem, will reply to you as soon as possible.

It can work properly with repo https://github.com/yangky11/miniF2F-lean4. Maybe this info will help.

What is your reproduced performance by using this repo?