allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.41k stars 439 forks source link

1B generates whitespace after a specific amount of fine-tuning #515

Closed KaiserWhoLearns closed 5 months ago

KaiserWhoLearns commented 6 months ago

🐛 Describe the bug

Thanks for making the model available! I was trying to fine-tune OLMO-1b with the Open Instruct code base. However, after fine-tuning the model for a specific amount of time/instances, the model will start to generate nothing (""), while the un-fine-tuned model seems to behave normally during generation and achieve a reasonable amount of performance on the datasets I am using.

Below is a test of fine-tuning on a randomly sampled XSUM dataset, Amount of instance fine-tuned 1 epoch 3 epochs 5 epochs
100 Generating A Generating B Generating, but repeating colons
500 Not Generating - -
1000 Not Generating - -

An example of input (prompt)-gold (completion)-output for the first row is included here: https://gist.github.com/KaiserWhoLearns/fe5260b08878f2cfb7e40e42a2239afa

Is it an issue of prompt format?

I tried two different prompt formats (they differed in the space after :)

"### Input: blabla bla, some article.\"\n ### Summary: "
"### Input: blabla bla, some article.\"\n ### Summary:"

When the model is trained with the first format, it will generate nothing (""), as shown in table above. When the model is trained with the second format, issue in the table occurs: it is simply repeating the colons ("::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::").

Could it be a result of bad hyperparameters?

I tried learning rate = 2e-5 and 2e-3, behavior persists.

Is the code working, after all?

I tried the exact same training code with LLAMA, which is generating as normal and shows that fine-tuning brings a performance improvement compared to the non-fine-tuned version.

Other information

Datasets I've tried: XSUM, SocialIQA, both with no luck. The training curve seems normal: https://wandb.ai/kaisersunhk/open_instruct/runs/rraheptp?nw=nwuserkaisersunhk

Code to reproduce the issue

The code is based on the Open-Instruct repository, with the modification to try different prompt formats. The command to reproduce the generation is

export data_file_name=xsum
export model_name=allenai/OLMo-1B
export epoch=1
export output_folder_name=olmo1b_${data_file_name}_${epoch}epoch

NUM_GPUS=1
BATCH_SIZE_PER_GPU=1
TOTAL_BATCH_SIZE=8
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))

accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes $NUM_GPUS \
    --use_deepspeed \
    --deepspeed_config_file ds_configs/stage3_no_offloading_accelerate.conf \
    open_instruct/finetune.py \
    --model_name_or_path ${model_name} \
    --train_file ${train_file} \
    --max_seq_length 2048 \
    --add_bos \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
    --learning_rate 2e-7 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0. \
    --num_train_epochs $epoch \
    --output_dir $output \
    --with_tracking \
    --report_to wandb \
    --logging_steps 1

export base_dir="code dir"
export data_file_name=socialiqa_test_data
export model_name=olmo1b_socialiqa_1epoch
export output_file_name="${data_file_name}_${model_name}"

export input_file_name="${base_dir}/data/evaluation/${data_file_name}.jsonl"
export output_file="${base_dir}/output/predictions/${output_file_name}.jsonl"
export model_path="${base_dir}/output/${model_name}"

python -m eval.predict \
        --model_name_or_path $model_path \
        --input_files $input_file_name \
        --max_new_tokens 60 \
        --output_file $output_file 

Versions

Python 3.11.0 absl-py==2.1.0 accelerate==0.27.2 ai2-olmo==0.2.4 aiofiles==23.2.1 aiohttp==3.9.3 aiosignal==1.3.1 alpaca_eval==0.5.3 altair==5.2.0 antlr4-python3-runtime==4.9.3 anyio==4.3.0 appdirs==1.4.4 attrs==23.2.0 auto-gptq==0.6.0 bitsandbytes==0.42.0 blinker==1.7.0 boto3==1.34.57 botocore==1.34.57 cached_path==1.6.2 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 cmake==3.28.3 contourpy==1.2.0 cycler==0.12.1 datasets==2.14.7 deepspeed==0.13.5 dill==0.3.7 distro==1.9.0 docker-pycreds==0.4.0 einops==0.7.0 et-xmlfile==1.1.0 evaluate==0.4.1 fastapi==0.110.0 ffmpy==0.3.2 filelock==3.13.1 fire==0.5.0 flash-attn==2.2.2 Flask==3.0.2 fonttools==4.49.0 frozenlist==1.4.1 fsspec==2023.10.0 gekko==1.0.7 gitdb==4.0.11 GitPython==3.1.42 google-api-core==2.17.1 google-auth==2.28.1 google-cloud-core==2.4.1 google-cloud-storage==2.15.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.62.0 gradio==3.50.2 gradio_client==0.6.1 grpcio==1.62.0 h11==0.14.0 hjson==3.1.0 httpcore==1.0.4 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.21.4 idna==3.6 importlib_resources==6.1.2 itsdangerous==2.1.2 Jinja2==3.1.3 jmespath==1.0.1 joblib==1.3.2 jsonlines==4.0.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 lit==17.0.6 Markdown==3.5.2 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.8.3 mdurl==0.1.2 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.15 networkx==3.2.1 ninja==1.11.1.1 nltk==3.8.1 numpy==1.26.4 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 omegaconf==2.3.0 openai==1.13.3 openpyxl==3.1.2 orjson==3.9.15 packaging==23.2 pandas==2.2.1 peft==0.9.0 pillow==10.2.0 protobuf==4.25.3 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==15.0.0 pyarrow-hotfix==0.6 pyasn1==0.5.1 pyasn1-modules==0.3.0 pydantic==1.10.14 pydub==0.25.1 Pygments==2.17.2 pynvml==11.5.0 pyparsing==3.1.2 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 ray==2.9.3 referencing==0.33.0 regex==2023.12.25 requests==2.31.0 responses==0.18.0 rich==13.7.1 rouge==1.0.1 rouge-score==0.1.2 rpds-py==0.18.0 rsa==4.9 s3transfer==0.10.0 safetensors==0.4.2 scipy==1.12.0 semantic-version==2.10.0 sentencepiece==0.2.0 sentry-sdk==1.40.6 setproctitle==1.3.3 six==1.16.0 smmap==5.0.1 sniffio==1.3.1 starlette==0.36.3 sympy==1.12 tensorboard==2.16.2 tensorboard-data-server==0.7.2 termcolor==2.4.0 tiktoken==0.6.0 tokenizers==0.15.2 toolz==0.12.1 torch==2.0.1 torchaudio==2.0.2 torchvision==0.15.2 tqdm==4.66.2 transformers==4.38.2 triton==2.0.0 typing_extensions==4.10.0 tzdata==2024.1 unidic-lite==1.0.8 urllib3==2.0.7 uvicorn==0.27.1 uvloop==0.19.0 vllm==0.2.1.post1 wandb==0.16.4 watchfiles==0.21.0 websockets==11.0.3 Werkzeug==3.0.1 xformers==0.0.22 xxhash==3.4.1 yarl==1.9.4

natolambert commented 6 months ago

I think this may be more of an issue for open-instruct, but also it is quite normal for degenerate behaviors to appear during fine-tuning. Trying new datasets, new parameters, etc is normal. It's good feedback on the model, but not sure it's relevant here. And yes, we'll keep improving the models.