Closed ZLKong closed 9 months ago
Could you please provide the keys in "/data/mengxin/FuseLLM/lm7b_rep/0_10000"? Upon successful execution of the scripts "get representations for each LLM", the "per_step_logits" will be in the processed dataset.
Hi,
Thank you very much for your quick response!
I am testing out the script, so I split the dataset into very small sets for quick testing.
Regarding the/data/mengxin/FuseLLM/lm7b_rep/0_10000
My script for generating the representations is :
export CUDA_VISIBLE_DEVICES="2"
python -m src.utils.forward_for_logits \
--model_name_or_path /data/mengxin/FuseLLM/Llama-2-7b-hf \
--dataset /data/mengxin/FuseLLM/minipile/split/ \
--dataset_save_dir /data/mengxin/FuseLLM/lm7b_rep/0_10000 \
--dataset_split_num 10000 \
--dataset_index 0 \
--cache_dir ./cache/ \
--model_max_length 2048 \
--training_mode full \
--load_in_half bf16 \
--batch_size 6 \
--preprocessing_num_workers 80 \
--top_k_logits 10 \
--save_per_token_metric 2>&1 > rep_0_10000_lm7b.log 2>&1 &
The dataset look like:
dataset_info.json
{
"citation": "",
"description": "",
"features": {
"text": {
"dtype": "string",
"_type": "Value"
},
"input_ids": {
"feature": {
"dtype": "int32",
"_type": "Value"
},
"_type": "Sequence"
},
"attention_mask": {
"feature": {
"dtype": "int8",
"_type": "Value"
},
"_type": "Sequence"
},
"labels": {
"feature": {
"dtype": "int64",
"_type": "Value"
},
"_type": "Sequence"
},
"per_step_metric_ce": {
"feature": {
"dtype": "float16",
"_type": "Value"
},
"_type": "Sequence"
},
"per_step_logits": {
"feature": {
"feature": {
"dtype": "float16",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"per_step_indices": {
"feature": {
"feature": {
"dtype": "int64",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"metric_ce": {
"dtype": "float16",
"_type": "Value"
}
},
"homepage": "",
"license": ""
}
state.json
{
"_data_files": [
{
"filename": "data-00000-of-00001.arrow"
}
],
"_fingerprint": "2302e5a489a55fe6",
"_format_columns": null,
"_format_kwargs": {},
"_format_type": null,
"_output_all_columns": false,
"_split": null
}
I found the error, it is because in token_alignment.py https://github.com/18907305772/FuseLLM/blob/1941b94cf062a752cf2ea407e0dded5034dce1c5/FuseLLM/src/utils/token_alignment.py#L105
I changed the load_from_disk()
into load_dataset()
But the reason that I changed this is because I had another error accurred in the previous script
FileNotFoundError: Directory /data/mengxin/FuseLLM/aligned_dataset_rep/llama_opemlm_0_10000 is neither a
Datasetdirectory nor a
DatasetDictdirectory.
So I followed the solution in https://github.com/huggingface/datasets/issues/6111
But it turns out this will cause the current "per_step_logits" issue.
I do not know the difference between these two load_from_disk()
into load_dataset()
Hi,
I encounter an error when running the data alignment
My script is
Is my save direction from previous script correct?
(e.g. ' --vocab_mapping_save_dir /data/mengxin/FuseLLM/vocab_mapping/llama_2_7b_mpt_7b.json \ ')