KeyError: 'per_step_logits' when running token_alignment.py

ZLKong commented 6 months ago

Hi,

I encounter an error when running the data alignment

My script is

# llama_2_7b <-> open_llama_7b_v2
export CUDA_VISIBLE_DEVICES="0"
python -m src.utils.vocab_mapping \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/open_llama_7b_v2 \
  --dataset_dir /data/mengxin/FuseLLM/minipile/split/ \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1

# llama_2_7b <-> mpt_7b
export CUDA_VISIBLE_DEVICES="1"
python -m src.utils.vocab_mapping \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/mpt-7b \
  --dataset_dir /data/mengxin/FuseLLM/minipile/split/ \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_mpt_7b.json \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1

# Align representations from different LLMs.

# llama_2_7b <-> open_llama_7b_v2
export CUDA_VISIBLE_DEVICES="0"
python -m src.utils.token_alignment \
  --base_model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
  --blending_model_name_or_path /data/mengxin/FuseLLM/open_llama_7b_v2 \
  --base_dataset_dir /data/mengxin/FuseLLM/lm7b_rep/0_10000 \
  --blending_dataset_dir /data/mengxin/FuseLLM/openlm7b_rep/0_10000 \
  --dataset_save_dir /data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000 \
  --cache_dir ./cache/ \
  --model_max_length 2048 \
  --preprocessing_num_workers 80 \
  --batch_size 100 \
  --blending_model_index 0 \
  --vocab_align_type "soft" \
  --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json \
  --metric_level "sequence"

03/01/2024 08:34:54 - INFO - main - Data processing args: Namespace(base_model_name_or_path='/data/mengxin/FuseLLM/Llama-2-7b-hf', blending_model_name_or_path='/data/mengxin/FuseLLM/open_llama_7b_v2', base_dataset_dir='/data/mengxin/FuseLLM/lm7b_rep/0_10000', blending_dataset_dir='/data/mengxin/FuseLLM/openlm7b_rep/0_10000', dataset_save_dir='/data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000', cache_dir='./cache/', model_max_length=2048, preprocessing_num_workers=80, batch_size=100, blending_model_index=0, vocab_align_type='soft', vocab_mapping_save_dir='/data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_open_llama_7b_v2.json', metric_level='sequence') 03/01/2024 08:34:54 - INFO - src.utils.others - Loading tokenizer. Using pad_token, but it is not set yet. 03/01/2024 08:34:54 - INFO - src.utils.others - bos_token: ~~, 1 eos_token:~~ , 2 unk_token: , 0 pad_token: , 0 03/01/2024 08:34:54 - INFO - src.utils.others - Loading tokenizer. You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565 Using pad_token, but it is not set yet. 03/01/2024 08:34:54 - INFO - src.utils.others - bos_token: ~~, 1 eos_token:~~ , 2 unk_token: , 0 pad_token: , 0 num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. 03/01/2024 08:34:54 - WARNING - datasets.arrow_dataset - num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1. Align blending model's logits with base model's logits.: 0%| | 0/1 [00:00<?, ? examples/s] Traceback (most recent call last): File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/mengxin/FuseLLM/src/utils/token_alignment.py", line 162, in base_model_blending_model_logits_datasets[k] = base_model_logits_datasets[k].map( File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3105, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3482, in _map_single batch = apply_function_on_filtered_inputs( File "/home/user/anaconda3/envs/fusellmpy39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, **fn_kwargs) File "/data/mengxin/FuseLLM/src/utils/token_alignment.py", line 120, in align_blending_model_logits_with_base_model_logits feature_1["per_step_logits"] = feature_1["per_step_logits"][:len(feature_1['input_ids'])] KeyError: 'per_step_logits'

Is my save direction from previous script correct? (e.g. ' --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_mpt_7b.json \ ')

18907305772 commented 6 months ago

Could you please provide the keys in "/data/mengxin/FuseLLM/lm7b_rep/0_10000"? Upon successful execution of the scripts "get representations for each LLM", the "per_step_logits" will be in the processed dataset.

ZLKong commented 6 months ago

Hi,

Thank you very much for your quick response!

I am testing out the script, so I split the dataset into very small sets for quick testing.

Regarding the/data/mengxin/FuseLLM/lm7b_rep/0_10000 My script for generating the representations is :

export CUDA_VISIBLE_DEVICES="2"
python -m src.utils.forward_for_logits \
   --model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
   --dataset /data/mengxin/FuseLLM/minipile/split/ \
   --dataset_save_dir /data/mengxin/FuseLLM/lm7b_rep/0_10000 \
   --dataset_split_num 10000 \
   --dataset_index 0 \
   --cache_dir ./cache/ \
   --model_max_length 2048 \
   --training_mode full \
   --load_in_half bf16 \
   --batch_size 6 \
   --preprocessing_num_workers 80 \
   --top_k_logits 10 \
   --save_per_token_metric 2>&1 > rep_0_10000_lm7b.log 2>&1 &

The dataset look like: lm7b

dataset_info.json

{
  "citation": "",
  "description": "",
  "features": {
    "text": {
      "dtype": "string",
      "_type": "Value"
    },
    "input_ids": {
      "feature": {
        "dtype": "int32",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "attention_mask": {
      "feature": {
        "dtype": "int8",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "labels": {
      "feature": {
        "dtype": "int64",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "per_step_metric_ce": {
      "feature": {
        "dtype": "float16",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "per_step_logits": {
      "feature": {
        "feature": {
          "dtype": "float16",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "per_step_indices": {
      "feature": {
        "feature": {
          "dtype": "int64",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "metric_ce": {
      "dtype": "float16",
      "_type": "Value"
    }
  },
  "homepage": "",
  "license": ""
}

state.json

{
  "_data_files": [
    {
      "filename": "data-00000-of-00001.arrow"
    }
  ],
  "_fingerprint": "2302e5a489a55fe6",
  "_format_columns": null,
  "_format_kwargs": {},
  "_format_type": null,
  "_output_all_columns": false,
  "_split": null
}

ZLKong commented 6 months ago

I found the error, it is because in token_alignment.py https://github.com/18907305772/FuseLLM/blob/1941b94cf062a752cf2ea407e0dded5034dce1c5/FuseLLM/src/utils/token_alignment.py#L105

I changed the load_from_disk()into load_dataset()

But the reason that I changed this is because I had another error accurred in the previous script

FileNotFoundError: Directory /data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000 is neither aDatasetdirectory nor aDatasetDictdirectory.

So I followed the solution in https://github.com/huggingface/datasets/issues/6111

But it turns out this will cause the current "per_step_logits" issue.

I do not know the difference between these two load_from_disk()into load_dataset()

18907305772 / FuseAI

KeyError: 'per_step_logits' when running token_alignment.py #6