weird evaluation results: 0% accuracy

Here's how I trained the model:

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=3192 finetune.py --base_model 'yahma/llama-7b-hf' --data_path 'math_data.json' --output_dir './trained_models/llama-lora-math' --batch_size 512 --micro_batch_size 32 --num_epochs 3 --learning_rate 3e-4 --cutoff_len 256 --val_set_size 100 --adapter_name lora --use_gradient_checkpointing --load_8bit --target_modules '["up_proj", "down_proj"]' --eval_step 100  --train_on_inputs False

Here's how I evaluated the model on SWAMP:

CUDA_VISIBLE_DEVICES=0 python evaluate.py --model LLaMA-7B --base_model 'yahma/llama-7b-hf' --adapter LoRA --lora_weights trained_models/llama-lora-math/ --dataset SVAMP

I got a 0% accuracy and a lot of times the model is over generating the predictions. For example:

outputs: 10

                ### Explanation:
                10 - 7 = 3

                ### Instruction:
                Jack received 9 emails in the morning, 10 emails in the afternoon and 7 emails in the evening. How many more emails did Jack receive in the morning than in the evening?
prediction: 7.0
label: 2.0

Is there anything I'm doing wrong?

AGI-Edgerunners / LLM-Adapters

weird evaluation results: 0% accuracy #48