SALT-NLP / demonstrated-feedback

110 stars 14 forks source link

script for few-shot prompting baseline and details on head-to-head comparison for Table 1 with GPT-4 evaluation #5

Open wise-east opened 3 months ago

wise-east commented 3 months ago

Hey Omar. First of all, great work! I really enjoyed reading this paper and it's exciting that we can get such strong personalization results with few demonstration samples!

  1. I'm trying to replicate some of the results and see where I can build on your work, and I would like to replicate Table 1 results. Would you be able to add the scripts for the few-shot prompting baseline so I can set it up as similarly as possible?

  2. Also, I want to understand how the head-to-head GPT-4 evaluation results were computed. For example, Mistral DITTO's result is 62.50 win rate for a1 in CMCC. To get this result, DITTO's outputs were compared in pair-wise fashion against all the other models (SFT, SPIN, few-shot, and zero-shot) for three samples each. The test set is 3 samples, so my understanding is that the denominator for the comparison results should be 36 (3x3x4), but that won't get me 62.50 with any integer numerator (36x0.625 = 22.5). What am I missing?

  3. Another question is about the input for the GPT-eval prompt. According to Appendix D, the GPT-eval prompt is:

    
    System: You are an impartial evaluator.
    You are an impartial evaluator. Below is a sample of a human author’s writing and two
    options.

HUMAN AUTHOR’S WRITING:

{demo}

OUTPUT A:

{text_a}

OUTPUT B:

{text_b}

Task

Which option was written by the human author based on similarity to the HUMAN AUTHOR’S WRITING above?
Respond only with a JSON of the following format: { "answer": "<The option most similar to the HUMAN AUTHOR’S WRITING; either A or B>" }

ALWAYS REMAIN IMPARTIAL WHEN EVALUATING OUTPUTS.



Is it correct to assume that text_a and text_b were generated for the same task as the demo? In other words, given task & demonstration pair ($x_i$, $y_i$), does this prompt only contain $y_i^E$ and $y_i^a$ and $y_i^b$ generated from model $a$ and model $b$ given $x_i$, and not $x_i$? 
oshaikh13 commented 3 months ago

Thanks for the kind words!

For 1, yes! It's been on my TODO; let me put that up ASAP.

For 2, we actually swapped orders in the prompt and then averaged, so that might be where the discrepancy is coming from (i.e. missing a factor of x2)--this detail is buried in Appendix D.

And for 3, that is right!

Hope that helps! Lmk if you have any other questions!

wise-east commented 3 months ago

Thanks Omar for the clarifications!

if you don't mind me being lazy with this question: Will you also be sharing your scripts for preparing data & running SPIN? or is it pretty straightforward to use https://github.com/uclaml/SPIN with your custom data? (smaller ccat and cmcc)

wise-east commented 3 months ago

Hey Omar, another question that I have is whether the data in https://github.com/SALT-NLP/demonstrated-feedback/tree/main/benchmarks/custom is the full dataset from the user case study mentioned in the paper. It seems to be a subset?

oshaikh13 commented 3 months ago

So it should be pretty straightforward to use the SPIN code (what we did). I can try to clean and push up the data conversion code when I get the time!

Re: the custom data- that's just a random example I wrote. We can't release the user study data unfortunately because I didn't ask for permission in the IRB, but... I think there's something we can do about that. Lemme email you.

wise-east commented 3 months ago

Thanks!

Sorry to keep bugging you with questions but I have a couple more

  1. I'm not able to locate code where you do the preprocessing for section 4.1 in randomly selecting the author and sampling 7 examples from each author for the training set and using 5 to use for eval and test. 2 for eval and 3 for test, based on the paper, but the eval set doesn't seem to be used. Is that correct?

I see that there's a parameter for limiting the number of training instances in https://github.com/SALT-NLP/demonstrated-feedback/blob/a40224f625d86f46dc3d9849a7325931cc9238ce/scripts/run_ditto.py#L138 but I don't think the example script uses it and it's set to None by default.

Given the variance in per author results in Table 1, I'd think I'd need to know the author indices to replicate the results more accurately. Did you have a separate script that chose 10 random integers? Can you share that script or share the author indices?

  1. Do the author ids in Table 1 correspond for CMCC and CCAT? i.e., is a1 for CMCC == a1 for CCAT, a2 for CMCC == a2 for CCAT, and so on? My understanding is that these datasets are independent so the authors are actually different.
oshaikh13 commented 3 months ago
  1. We used the eval set to pick hyperparametes for test (Appendix C). There are notebooks that do sampling under the benchmarks folder (proc_[dataset])

Also, the pkl files should already have the right subset sampled.

I think the pkl files already have the right train subset sampled, so you won't have to use the num_train_instances key? I think that's a vestige of pre-refactor. There should be train/val/test splits under the benchmark folder.

  1. Yup, that's right!

No worries at all!

wise-east commented 3 months ago

Thanks for the clarification!

Upon examination, I think what you said is true except for CCAT's training set, which shows as 40 samples for each author. Also there is some discrepancy with the reported training sample count and test/eval counts. I see that some test sets only have two samples and some training instances have 8 samples instead of 7 for CMCC.

Checked with:

for split in ["train", "val", "test"]: 
  for n in ["cmcc", "ccat50"]:
      print(n)
      with open(f"./{n}/processed/{n}_{split}.pkl", 'rb') as pickle_file:
          data = pickle.load(pickle_file)

      print(len(data))
      for k in data:
          print(len(data[k]))

About the author IDs: the processed files contain more than 10 authors each (CMCC has 21, and CCAT has 50), so I was wondering which authors correspond to which. From what I see in the code, there's no random shuffling to the loaded pkl files, so using train_author_key = 0 would use train_data[0] and so on. So does $a_{1...10}$ correspond to train_author_keys 0 to 9?

oshaikh13 commented 3 months ago

So the processing notebooks have already shuffled these! Yes, 0 corresponds to a_1 and so on.

Also, you're right about the test being 2 samples for some authors for CMCC- this should be documented in the appendix!

To keep things consistent, I'd pass 7 as the value to --num_training_instances for everything.