Cannot reproduce the results of Llama7B dora_r32.

xiaoshingshing2 commented 3 months ago

First of all, using the official checkpoint is ok. The results on BoolQ is 69.63 while the official result is 69.7.

However, when I try to reproduce the results, I encounter two problems.

The first is about Llama7B dora_r32 without dora_simple. I change three commands in llama_7B_Dora.sh. For example, the micro_batch_size from 16 to 4, learning_rate from 2e-4 to 1e-4, and add --dora_simple False to avoid using dora_simple. I use the command line sh llama_7B_Dora.sh 32 64 ./finetuned_result/dora_r32 0, and the results are

BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Average
69.3	78.9	78.3	54.3	80.0	82.6	66.1	81.0	73.8

which are worse than the official results.

The second is that when I delete the --dora_simple False to accelerate the training process with dora_simple, the results are even worse:

BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Average
32.9	75.5	71.8	9.9	41.3	81.9	66.3	75.8	56.9

xiaoshingshing2 commented 3 months ago

This is the log of the training process and adapter_config with --dora_simple False trainer_state.json adapter_config.json

nbasyl commented 3 months ago

Did you install all the packages following requirements.txt?

xiaoshingshing2 commented 3 months ago

Hi, I donot install bitsandbytes and my pytorch version is 2.1.0. The transformers package is installed with pip install transformers==4.36.0. Other packages are the same as requirements.txt.

Will that hurt the performance?

The packages I use are listed below:

Package Version

accelerate 0.25.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 black 23.12.0 certifi 2024.6.2 charset-normalizer 3.3.2 click 8.1.7 cmake 3.29.6 contourpy 1.2.1 cycler 0.12.1 datasets 2.15.0 decorator 5.1.1 dill 0.3.7 dnspython 2.6.1 email_validator 2.2.0 exceptiongroup 1.2.1 executing 2.0.1 fastapi 0.111.0 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.4 fire 0.5.0 fonttools 4.53.0 frozenlist 1.4.1 fsspec 2023.10.0 gradio 4.9.0 gradio_client 0.7.2 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 idna 3.7 importlib_resources 6.4.0 ipython 8.25.0 jedi 0.19.1 Jinja2 3.1.4 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 lit 18.1.8 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 mypy-extensions 1.0.0 networkx 3.3 numpy 1.26.4 orjson 3.10.5 packaging 24.1 pandas 2.2.2 parso 0.8.4 pathspec 0.12.1 pexpect 4.9.0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 prompt_toolkit 3.0.47 protobuf 5.27.2 psutil 6.0.0 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.4 pydantic_core 2.18.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytorch-triton-rocm 2.1.0 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rpds-py 0.18.1 safetensors 0.4.3 scipy 1.11.4 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 69.5.1 shellingham 1.5.4 six 1.16.0 sniffio 1.3.1 stack-data 0.6.3 starlette 0.37.2 sympy 1.12.1 termcolor 2.4.0 tokenize-rt 5.2.0 tokenizers 0.15.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.1.0+rocm5.6 tqdm 4.66.4 traitlets 5.14.3 transformers 4.36.0 typer 0.12.3 typing_extensions 4.12.2 tzdata 2024.1 ujson 5.10.0 urllib3 2.2.2 uvicorn 0.30.1 uvloop 0.19.0 watchfiles 0.22.0 wcwidth 0.2.13 websockets 11.0.3 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

I am trying to use exactly the same package as in requirements.txt, and will update my results when the finetuning and testing process finish.

xiaoshingshing2 commented 3 months ago

I use exactly the packages in requirements.txt, and the results are:

with dora simple:

BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Average
69.1	82.8	78.8	86.2	81.0	82.1	66.1	79.2	78.2

without dora simple:

BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Average
68.7	83.3	79.4	85.5	81.3	80.8	66.0	78.8	78.0

which has 0.4% average accuracy gap between the results reported in readme.

xiaoshingshing2 commented 2 months ago

New updates:

I use exactly the packages in requirements.txt, and the results on r=[8,16] still have a large gap with the results reported in readme, while the results on r=[4,64] are better and the result on r=[32] is roughly equal.

Average acc:

r	original	reproduce
4	61.9	65.2
8	77.9	72.5
16	77.5	62.7
32	78.4	78.2
64	76.8	77.9

Is this a normal result?

zhanqiqi77 commented 2 months ago

@xiaoshingshing2 I have encountered a similar issue. Did you manage to resolve it? Could you provide your package versions?

xiaoshingshing2 commented 2 months ago

@xiaoshingshing2 I have encountered a similar issue. Did you manage to resolve it? Could you provide your package versions?

In the latest update, I used exactly the packages in requirements.txt with the same versions. I still have the problems.

NVlabs / DoRA

Cannot reproduce the results of Llama7B dora_r32. #14