Issue with Running deferral_generate.sh Script

hechengbo-H commented 5 months ago

When I run the bash scripts/evaluation/gsm8k/deferral_generate.sh script, I keep seeing the message "Checking again..." repeatedly. I am not sure what this means or how to resolve it. Could you please help me understand what is causing this and how to fix it?

hechengbo-H commented 5 months ago

When I run deferral_generation.sh, a file seems to be missing.

lolipopshock commented 5 months ago

When I run the bash scripts/evaluation/gsm8k/deferral_generate.sh script, I keep seeing the message "Checking again..." repeatedly. I am not sure what this means or how to resolve it.

This means the eval code is waiting for the vllm server to start.

When I run deferral_generation.sh, a file seems to be missing.

Oh yes, there is a intermediate evaluation step you need to run (after running deferral generation at different deferral thresholds)

python collm/eval.py eval_folder \
    --task_name gsm8k \
    --orig_data_path static_eval-completion/gsm-0shot/validation \
    --pred_data_folder checkpoints/generate/gsm8k-finetuned/gsm-0shot/gsm8k-def-finetuned-Llama-2-7b+Llama-2-70b/_deferral_search \

As for static_eval-completion/gsm-0shot/validation, you might want to run formatting.py to get the folder.

python collm/dataset/formatting.py --dataset-name gsm --output-path static_eval-completion --output-name gsm-0shot

(You might want to modify the following line to make it 0-shot. ) https://github.com/clinicalml/co-llm/blob/d83ab8cf19ceb49c7814383a897089e51cdeebfe/collm/dataset/formatting.py#L1329

I'll add this to the main README in a bit. Hope that helps!

hechengbo-H commented 5 months ago

I followed your instructions, but the issue remains unresolved.

First, I ran the following command (# n_shot=1 -> n_shot=0 ) python collm/dataset/formatting.py --dataset-name gsm --output-path static_eval-completion --output-name gsm-0shot The result was:

Next, I executed: python collm/eval.py eval_folder --task_name gsm8k --orig_data_path static_eval-completion/gsm-0shot/validation --pred_data_folder /space/hecb/co-llm-main/checkpoints/generate/gsm8k-finetuned/gsm-0shot/gsm8k-def-finetuned-Llama-2-7b+EleutherAI@llemma_7b/_deferral_search The result was:

Finally, I ran: bash scripts/evaluation/gsm8k/deferral_generate.sh The result was:

Can you help me identify what might be going wrong?

lolipopshock commented 5 months ago

For the 2nd step,

python collm/eval.py eval_folder --task_name gsm8k --orig_data_path static_eval-completion/gsm-0shot/validation --pred_data_folder /space/hecb/co-llm-main/checkpoints/generate/gsm8k-finetuned/gsm-0shot/gsm8k-def-finetuned-Llama-2-7b+EleutherAI@llemma_7b/_deferral_search

You need to run deferral search (see https://github.com/clinicalml/co-llm/blob/d83ab8cf19ceb49c7814383a897089e51cdeebfe/scripts/evaluation/generic/deferral_generate.sh#L54) so it will create the individual folders in the _deferral_search folder, e.g.,

_deferral_search
|- _defer-0.00
|- _defer-0.10
...

Then running the 2nd step code will produce you the eval_with_deferral_threshold.csv file (in fact, your current output after step 2 in the screenshot is an empty list [], meaning that the target folder does not contain the desired files).

hechengbo-H commented 5 months ago

Hi, I've encountered a new issue. When I run the command: python collm/generate.py deferral_threshold_search --dataset "/space/hecb/co-llm-main/static_eval-completion/gsm-0shot/validation" --save_path "/space/hecb/co-llm-main/checkpoints/generate/gsm8k-finetuned/gsm-0shot/gsm8k-def-finetuned-Llama-2-7b+EleutherAI@llemma_7b/" --num_proc 8 --max_tokens 512 --base_model_port 9601 --ref_model_port 9600 --batch_gen_port 8003 --n_samples_for_search 15 --tokens_per_call 1 --tokenizer_name_ref "/space/hecb/co-llm-main/weights/EleutherAI@llemma_7b" --tokenizer_name_base "/space/hecb/co-llm-main/checkpoints/deferral_finetuned/gsm8k-completion_EleutherAI@llemma_7b/Llama-2-7b-hf~64bz" --debug I get the following error: Then I improved the code example: After the modification, my issue became this:

Do you know how to solve this problem? Also, I'm curious about the purpose of accessing this URL.

hechengbo-H commented 5 months ago

After solving the network problem, I encountered a new problem.

lolipopshock commented 5 months ago

I think in this case, you might be running co-llm with a model other than the llama based model (tokenizer)? I am a bit unsure why you have index 50118 there.

hechengbo-H commented 5 months ago

Hello,

I followed the steps in your readme.md file, but encountered some issues. Here's a detailed breakdown of what I did:

Data Processing

Step 1: Format the dataset

python scripts/train/gsm8k/create_train_data.py

Step 2: Create the training data for deferral training

bash scripts/train/gsm8k/process_dataset.sh

Step 3: Create the initialization dataset

bash scripts/train/gsm8k/create_init_dataset.sh

Then, I ran the following script:

bash scripts/train/gsm8k/deferral_finetune_with_hf_trainer.sh

I modified the files as shown in Image.

Inference Phase

During the inference phase, as per our previous communication, I intended to run the program step-by-step. I created a script /space/hecb/co-llm-main/check/generate.sh to facilitate the generation of the following directory structure:

_deferral_search
|- _defer-0.00
|- _defer-0.10
...

The content of the script is as shown below:

> #!/bin/bash
> #SBATCH --partition=a6000
> #SBATCH --output=/space/hecb/co-llm-main/check/slurm-%j.out
> #SBATCH -n 1
> #SBATCH --gpus-per-task 8
> #SBATCH --cpus-per-task 12
> #SBATCH --mem 50g
> 
> # 激活环境
> source activate collm-Inference
> 
> # 设置 Node.js 的路径
> export NVM_DIR="$HOME/.nvm"
> source $NVM_DIR/nvm.sh
> 
> # 确保使用正确版本的 Node.js
> nvm use 14
> 
> # 启动 forward.js 服务器
> echo "=== Starting forward.js on port 8003 ==="
> node /space/hecb/co-llm-main/forward.js > /space/hecb/co-llm-main/check/forward.log 2>&1 &
> sleep 10s
> 
> # 检查 forward.js 服务器是否启动成功
> echo "=== Checking forward.js Health ==="
> for i in {1..10}; do
>   curl http://localhost:8003/health && break
>   echo "Waiting for forward.js on port 8003 to start..."
>   sleep 5s
> done
> 
> if [ $? -ne 0 ]; then
>     echo "forward.js server on port 8003 failed to start."
>     cat /space/hecb/co-llm-main/check/forward.log
>     exit 1
> fi
> 
> # 启动第一个 API 服务器 (api_server_simple.py)
> echo "=== Starting API Server Simple on port 9600 ==="
> CUDA_VISIBLE_DEVICES=1 \
>     python /space/hecb/co-llm-main/collm/inference/api_server_simple.py --host 0.0.0.0 --port 9600 \
>     --tensor-parallel-size 1 \
>     --model /space/hecb/co-llm-main/weights/EleutherAI@llemma_7b \
>     --tokenizer /space/hecb/co-llm-main/weights/EleutherAI@llemma_7b > /space/hecb/co-llm-main/check/api_server_simple.log 2>&1 &
> sleep 10s
> 
> # 启动第二个 API 服务器 (api_server_deferral.py)
> echo "=== Starting API Server Deferral on port 9601 ==="
> CUDA_VISIBLE_DEVICES=2 \
>     python /space/hecb/co-llm-main/collm/inference/api_server_deferral.py --host 0.0.0.0 --port 9601 \
>     --tensor-parallel-size 1 \
>     --model /space/hecb/co-llm-main/checkpoints/deferral_finetuned/gsm8k-completion_EleutherAI@llemma_7b/Llama-2-7b-hf~64bz \
>     --tokenizer /space/hecb/co-llm-main/checkpoints/deferral_finetuned/gsm8k-completion_EleutherAI@llemma_7b/Llama-2-7b-hf~64bz > /space/hecb/co-llm-main/check/api_server_deferral.log 2>&1 &
> sleep 10s
> 
> # 检查 API 服务器是否启动成功
> echo "=== Checking API Server Simple Health ==="
> for i in {1..10}; do
>   curl http://localhost:9600/health && break
>   echo "Waiting for API server on port 9600 to start..."
>   sleep 5s
> done
> 
> if [ $? -ne 0 ]; then
>     echo "API server on port 9600 failed to start."
>     cat /space/hecb/co-llm-main/check/api_server_simple.log
>     exit 1
> fi
> 
> echo "=== Checking API Server Deferral Health ==="
> for i in {1..10}; do
>   curl http://localhost:9601/health && break
>   echo "Waiting for API server on port 9601 to start..."
>   sleep 5s
> done
> 
> if [ $? -ne 0 ]; then
>     echo "API server on port 9601 failed to start."
>     cat /space/hecb/co-llm-main/check/api_server_deferral.log
>     exit 1
> fi
> 
> # 运行 generate.py 脚本
> echo "=== Running generate.py script ==="
> CUDA_VISIBLE_DEVICES=3,4,5,6,7 srun python /space/hecb/co-llm-main/collm/generate.py deferral_threshold_search \
>   --dataset "/space/hecb/co-llm-main/static_eval-completion/gsm-0shot/validation" \
>   --save_path "/space/hecb/co-llm-main/checkpoints/generate/gsm8k-finetuned/gsm-0shot/gsm8k-def-finetuned-Llama-2-7b+EleutherAI@llemma_7b/"\
>   --num_proc 8 --max_tokens 512 --base_model_port 9600 --ref_model_port 9601 --batch_gen_port 8003 --n_samples_for_search 15 --tokens_per_call 1 \
>   --tokenizer_name_ref "/space/hecb/co-llm-main/weights/EleutherAI@llemma_7b" \
>   --tokenizer_name_base "/space/hecb/co-llm-main/checkpoints/deferral_finetuned/gsm8k-completion_EleutherAI@llemma_7b/Llama-2-7b-hf~64bz" \
>   --debug
>

Finally, I ran the following command:

sbatch /space/hecb/co-llm-main/check/generate.sh

Please let me know if there are any mistakes or if I missed any steps.

Thank you!

lolipopshock commented 5 months ago

OK I think I find the issue:

In your code:

There are some spaces after the \ for some lines, indicated by the screenshot above. In that case, bash would ignore all the subsequent arguments. And for the vllm server, it will by default load the meta opt models, which has a vocal size of 50265. That's the reason why it messed up the logprobs assignment and caused the bug.

I'd suggest you check that for all your commands, or, to the best, use our code, to reproduce the results. I also saw you trained the models with 64bz, which we found achieved slightly worse performance in some cases. Finally the generate.py script alone does not need any gpus so you can remove the CUDA devices config in the end.

hechengbo-H commented 5 months ago

Thank you very much for your reply! I think I finally got through the readme! Of course, I have a few questions to ask. Hahaha

As you mentioned, the models with 64bz do not perform well because they have many instances of "Deferral Threshold: 1.0". Is this normal?
I have some pictures, but I can't understand what they mean. Could you help me explain them?
Are there any plans to open source related work on other datasets?
How was your “”demo.mp4“” created? It seems like an interesting and straightforward method, but I haven't been able to implement it myself.

I'm going to go through the code, try to understand it, and work on it a little bit. I really appreciate your response, as it's rare for people to reply so many times.

lolipopshock commented 4 months ago

Hey @hechengbo-H ! Sorry for the delayed response -- re your questions:

As you mentioned, the models with 64bz do not perform well because they have many instances of "Deferral Threshold: 1.0". Is this normal?

I am not sure how you've generated this output. But if you use our code, it will produce a csv file that tells the model accuracy at different deferral thresholds, and it will use the optimal threshold for the test dataset.

I have some pictures, but I can't understand what they mean. Could you help me explain them?

I think the most important one is the 2nd figure. It is effectively the calibration that maps deferral threshold to deferral frequency. In your case, indeed there are some issues with the training such that the deferral threshold is mostly 0.

Are there any plans to open source related work on other datasets?

Yes! Please stay tuned and we will release it in the next few weeks.

How was your “”demo.mp4“” created? It seems like an interesting and straightforward method, but I haven't been able to implement it myself.

I used Figma. If you are interested, you might want to check a class that I taught earlier on this topic https://better-visual.github.io

I'm going to go through the code, try to understand it, and work on it a little bit. I really appreciate your response, as it's rare for people to reply so many times.

Thank you very much! Yeah we are going to update some part of the code per your feedback and feel free to ask more questions!

lolipopshock commented 3 months ago

Closing b/c of inactivity; free feel to reopen for more questions.

clinicalml / co-llm

Issue with Running deferral_generate.sh Script #5

Inference Phase