process doubt for dataset

misitetong commented 1 month ago

I notice the process of dataset in unllama_token_clf.py.

max_length=64
tokenized_inputs = tokenizer(examples["tokens"], is_split_into_words=True, padding='longest', max_length=max_length, truncation=True)

but in conll2003, the max length after tokenize without truncation greater than 64, which is 228, get from the code as follows:

tokenizer(examples["tokens"], is_split_into_words=True, padding='do_not_pad',truncation=False)

Thus, your dataset maybe only the part of the whole dataset, which affects the final F1 score.

If there is anything I haven't noticed, please let me know.

SeanLee97 commented 1 month ago

hi @misitetong, the max length indeed affects the F1 score positively or negatively. From your comment, it tends to affect the performance negatively. But the fact is that it doesn't hurt the performance based on my experiment increasing max_length to 256 (>228).

You can have a try using the following code from BiLLM:

WANDB_MODE=disabled BiLLM_START_INDEX=0 CUDA_VISIBLE_DEVICES=0 python billm_ner.py \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--dataset_name_or_path conll2003 \
--batch_size 2 \
--max_length 256 \
--push_to_hub 1 \
--hub_model_id WhereIsAI/billm-mistral-7b-conll03-ner-maxlen-256

In the second epoch, its performance performs better than max_length=64 and is close to the best results of max_length=64, as shown in the following Figure.

The model is still running, you can check the results a few hours later (maybe 3 hours) in huggingface. The training log will be automatically updated to huggingface when the training is done.

For comparison with max_length=64, you can check this huggingface repo, where F1 is the result of the test set.

misitetong commented 1 month ago

Thank you for your replay. I have got the basically consistent results.

4AI / LS-LLaMA

process doubt for dataset #21