DAMO-NLP-SG / LLM-R2

35 stars 4 forks source link

Having trouble getting run_CLTrain.sh to execute #4

Open Elfinwang opened 2 weeks ago

Elfinwang commented 2 weeks ago

I’m having trouble getting the run_CLTrain.sh script to execute.

  1. Where to get 'file_name="data/data_simcse/${train_file}_for_simcse.csv'?
  2. I would appreciate some guidance on recommended parameters for training, such as the number of epochs to use, etc.
  3. Currently, I downloaded the ‘nli_for_simcse.csv’ from 'https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/nltasets-for-simcse/resolve/main/nli_for_simcse.csv', I encountered an error with the following line of code: (src/train.py, line 319) examples[sent0_cname][idx] = conv_dict(ast.literal_eval(examples[sent0_cname][idx].replace('−inf', '−2e308'))) The error occurs when trying to parse the string with ast.literal_eval.

I would appreciate your help!!!

LZ12DH commented 5 days ago

Hi,

Thanks for the feedback!

The '${train_file}_for_simcse.csv' file is obtained by running 'src/prepare_CL_dataset.py'. Sorry my training data is over 2GB and I could not upload it to the repo. You may run'prepare_CL_dataset.py' using some query triplets to get the files.

For the error you incurred, the reason is that we used a tree based query encoding which is different from plain text in SimCSE. Also in case you have trouble running the above-mentioned code, you may also contact me via email G220002@e.ntu.edu.sg and I can share you a small set of training data to see if this error still happens.

Hope this reply clarifies your doubts!