FLARE Benchmark in Google Colab: VLLM dependency and testing other models

paveles commented 6 months ago

Dear PIXIU team,

thank you so much for your contribution to the open source community and congratulations for being accepted to the renowned NEURIPS Conference. I am trying to follow proposed steps to run FLARE benchmark of the model. I follow the steps on the Google Colab T4 Instance. Here are the steps:

!git clone https://github.com/chancefocus/PIXIU.git --recursive
!pip install -r PIXIU/requirements.txt
!pip install -e ./PIXIU/src/financial-evaluation[multilingual]
!sh /content/PIXIU/scripts/run_evaluation.sh

where run_evaluation.sh is:

pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"

python ./PIXIU/src/eval.py \
    --model hf-causal-llama \
    --tasks flare_edtsum,flare_ectsum \
    --model_args use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False,max_gen_toks=1024,dtype=float16 \
    --no_cache \
    --batch_size 4 \
    --model_prompt 'finma_prompt' \
    --num_fewshot 0 \
    --write_out

The output is:

/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore
2024-01-21 09:50:38.484367: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-21 09:50:38.484426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-21 09:50:38.485793: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-21 09:50:39.709652: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/./PIXIU/src/eval.py", line 8, in <module>
    import evaluator
  File "/content/PIXIU/src/evaluator.py", line 8, in <module>
    import lm_eval.models
  File "/content/PIXIU/src/financial-evaluation/lm_eval/models/__init__.py", line 4, in <module>
    from . import huggingface
  File "/content/PIXIU/src/financial-evaluation/lm_eval/models/huggingface.py", line 12, in <module>
    from vllm import LLM, SamplingParams
ModuleNotFoundError: No module named 'vllm'

There are several associated questions:

Versions of packages in PIXIU/requirements.txt are not fixed that will very probably lead to version incompatibilities over time. Moreover, "vllm" is not listed there. IT it bossible to fix the versions there? That would improve reproducibility and readiness for future changes.

I try to evaluate a simple TinyLLama model that does not require large GPU instance. Even after installing vllm (which also changes some versions of the packages), I get an error for the evaluation:

!pip install vllm
!sh /content/PIXIU/scripts/run_evaluation.sh
!sh /content/PIXIU/scripts/run_evaluation.sh

with run_evaluation.sh:


pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"

python ./PIXIU/src/eval.py \ --model hf-causal \ --tasks flare_australian \ --model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float32" \ --no_cache

results in :

/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore 2024-01-21 10:10:46.733456: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-01-21 10:10:46.733512: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-01-21 10:10:46.735055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-01-21 10:10:48.082461: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [dynet] random seed: 1234 [dynet] allocating memory: 32MB [dynet] memory allocation done. Selected Tasks: ['flare_australian'] Using device 'cuda' config.json: 100% 652/652 [00:00<00:00, 2.81MB/s] model.safetensors: 100% 4.40G/4.40G [00:34<00:00, 126MB/s] generation_config.json: 100% 63.0/63.0 [00:00<00:00, 316kB/s] tokenizer_config.json: 100% 762/762 [00:00<00:00, 3.84MB/s] tokenizer.model: 100% 500k/500k [00:00<00:00, 402MB/s] tokenizer.json: 100% 1.84M/1.84M [00:00<00:00, 3.73MB/s] added_tokens.json: 100% 21.0/21.0 [00:00<00:00, 87.8kB/s] special_tokens_map.json: 100% 438/438 [00:00<00:00, 1.78MB/s] Downloading readme: 100% 641/641 [00:00<00:00, 4.14MB/s] Downloading data: 100% 65.3k/65.3k [00:02<00:00, 31.2kB/s] Downloading data: 100% 25.2k/25.2k [00:01<00:00, 14.0kB/s] Downloading data: 100% 16.8k/16.8k [00:01<00:00, 10.3kB/s] Generating train split: 100% 482/482 [00:00<00:00, 3767.86 examples/s] Generating test split: 100% 139/139 [00:00<00:00, 57843.86 examples/s] Generating valid split: 100% 69/69 [00:00<00:00, 32771.71 examples/s] Task: flare_australian; number of docs: 139 Task: flare_australian; document 0; context prompt (starting on next line): Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'. Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0.

(end of prompt on previous line) Requests: Req_greedy_until("Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'. \n Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0. \n", {'until': None})[None]

Running greedy_until requests Maximum 0 turns Running 0th turn 0% 0/139 [00:00<?, ?it/s]Both max_new_tokens (=32) and max_length(=575) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation) 0% 0/139 [00:02<?, ?it/s] Traceback (most recent call last): File "/content/./PIXIU/src/eval.py", line 97, in main() File "/content/./PIXIU/src/eval.py", line 62, in main results = evaluator.simple_evaluate( File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper return fn(*args, *kwargs) File "/content/PIXIU/src/evaluator.py", line 102, in simple_evaluate results = evaluate( File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper return fn(args, **kwargs) File "/content/PIXIU/src/evaluator.py", line 327, in evaluate resps = getattr(lm, reqtype)([req.args for req in reqs]) File "/content/PIXIU/src/financial-evaluation/lm_eval/base.py", line 459, in greedy_until for term in until: TypeError: 'NoneType' object is not iterable


Could you please help with debugging? Providing a replicable example of evaluation of some other simple model would be helpful.

3. Finally, I wonder why the financial evaluation of FLARE is done as a modified fork of [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). It would be helpful to know what caused the fork. Is it imaginable to integrate the FLARE tests in the original evaluation framework to make all tests in one framework?

Thank you in advance for your help!

jiminHuang commented 5 months ago

@ASCRX Could you please take a look at this issue?

ASCRX commented 5 months ago

Hello paveles:

Yes. We indeed use a certain version of vllm, which is vllm 0.2.7. Vllm supports most of the current models. Try the following step in colab enviroment: !pip install bert_score !pip install vllm==0.2.7
Please make sure you have downloaded BART checkpoint, and check all required arguments are correctly specified.
@jiminHuang can help with this problem.

jiminHuang commented 1 month ago

Please check our latest notebook https://colab.research.google.com/drive/1ogcCmhMc5lPhUamCk6512H3PJwPEaBZN?usp=sharing. All issues should be addressed.

The-FinAI / PIXIU

FLARE Benchmark in Google Colab: VLLM dependency and testing other models #46