HKUST-KnowComp / Knowledge-Constrained-Decoding

Official Code for EMNLP2023 Main Conference paper: "KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection"
27 stars 0 forks source link

The problem on the process train GPT2XL #2

Open cheesemarmot opened 4 months ago

cheesemarmot commented 4 months ago

Hello, I really admire the dedication and effort you put into your work. But I have few questions on the process when I was using the scripts to train GPT2XL and guide the decoding of GPT3.5.

First, what I want to know is how the checkpoint-best/pytorch_model.bin comes. Do you mean it comes from your training outcomes or just from the original model? When I loaded the script of fine-tune RIPA on gpt2, it did not work at all and showed the weight missing.

Second, would you be so kind to provide the version of packages in requirements.txt ? That will help a lot.

I appreciate your time and assistance in clarifying these points.

qishenghu commented 4 months ago

Tho haven't run through the whole process, but my attempt on this suggest we probably need 'transformers==4.33.2'.

qishenghu commented 4 months ago

I encountered the same issue as @cheesemarmot mentioned above. Seems able to run by replacing 'load_checkpoint' to 'load_peft_checkpoint' then load the lora checkpoint.

But another issue I found is that in author's code for loading dataset, it filter the dataset. When I run the inference on the cnn_dailymail dataset, it filter the test set from 11490 samples to only 1780 left. dataset = dataset.filter(lambda x: x['doc_len'] <= tokenizer.model_max_length - 25)

syncdoth commented 4 months ago

Hi all,

The checkpoint comes from the training of the lora weights for the discriminator. As per not being able to load the weight, I'll double check and continue this thread.

As per filtering the test dataset, thank you for pointing this out. This may have affected the test set selection, although the same test set have been used for different methods in the main results (as the base model/tokenizer is the same), and should not have affected the fairness of evaluation. We will release a fix and share the new evaluation results along with it.

qishenghu commented 4 months ago

Thanks. Like you said, the filtering operation does affect the test set selection but indeed does not affected the fairness of evaluation since you did all the experiments on the same size of test set.

As I also run through the code, it takes quite a long time to finish the inference process on both summarization and dialogue test set. Can I confirm with you that the inference process on a single GPU could usually take more than 1 day to complete? Cuz I am not so sure whether I run the code correctly. Thanks!!

syncdoth commented 4 months ago

Yes, unfortunately, it may take quite some time to generate (with MCTS steps 50, as in the default parameter). Especially, the summarization takes longer as the context size is longer. With single RTX 3080, it may take 2 days.

syncdoth commented 3 months ago

Hello, I really admire the dedication and effort you put into your work. But I have few questions on the process when I was using the scripts to train GPT2XL and guide the decoding of GPT3.5.

First, what I want to know is how the checkpoint-best/pytorch_model.bin comes. Do you mean it comes from your training outcomes or just from the original model? When I loaded the script of fine-tune RIPA on gpt2, it did not work at all and showed the weight missing.

Second, would you be so kind to provide the version of packages in requirements.txt ? That will help a lot.

I appreciate your time and assistance in clarifying these points.

I have updated the environment a while ago, but I believe I haven't updated the important packages. I included the deep-learning relevant packages and their versions below.

Package                            Version     Editable project location
---------------------------------- ----------- -------------------------------------------------
accelerate                         0.19.0
bert-score                         0.3.13
bitsandbytes                       0.38.1
BLEURT                             0.0.2
datasets                           2.9.0
deepspeed                          0.8.3
editdistance                       0.6.2
einops                             0.6.1
evaluate                           0.4.0
fire                               0.5.0
gensim                             4.3.0
huggingface-hub                    0.14.1
lightning-utilities                0.8.0
loralib                            0.1.1
ml-collections                     0.1.1
ml-dtypes                          0.2.0
networkx                           3.1
nltk                               3.7
numpy                              1.23.5
openai                             0.27.6
opt-einsum                         3.3.0
pandas                             1.5.2
peft                               0.3.0.dev0  # commit: 632997d1fb776c3cf05d8c2537ac9a98a7ce9435
pytorch-lightning                  1.9.0
rouge                              1.0.1
rouge-score                        0.1.2
sacrebleu                          2.3.1
safetensors                        0.4.0
scikit-learn                       1.2.0
scipy                              1.9.3
sentencepiece                      0.1.98
spacy                              3.5.2
spacy-legacy                       3.0.12
spacy-loggers                      1.0.4
tiktoken                           0.4.0
tokenizers                         0.13.3
torch                              1.13.1
torch-fidelity                     0.3.0
torchinfo                          1.7.2
torchmetrics                       0.11.4
torchtext                          0.14.1
torchtyping                        0.1.4
torchvision                        0.14.1
transformer-smaller-training-vocab 0.2.3
transformers                       4.30.0.dev0  # commit: cf11493dce0a1d22446efe0d6c4ade02fd928e50
wandb                              0.15.3