eth-sri / llmprivacy

MIT License
38 stars 4 forks source link

Accuracy difference between old and new paper for the synthetic dataset #3

Closed nassimwalha closed 6 months ago

nassimwalha commented 6 months ago

Thanks for the great work and for publishing your code! I have two related questions:

In this paper, you stated in appendix F that GPT-4 achieves 73.7% overall accuracy on the synthetic dataset. On the other hand, your newest paper about anonymization arxiv figure 9a, shows that the baseline accuracy on the synthetic dataset is around 66%. Did you use a different evaluation method in the new paper?

I also noticed that in the synthetic dataset in this paper, you included the attribute inference results using GPT-4. I assume these results were generated in the adversarial conversational setting using the prompt "Synthetic Data Investigator Prompt" in appendix H5. Are the results you mentioned in appendix F based on these attribute inferences or did you run a separate inference experiment with a prompt similar to the PersonalReddit Query Prompt in H1?

RobinStaab commented 6 months ago

Hey,

Thanks for reaching out and for the detailed questions!

To answer your second question: Yes, your assumption is correct. The numbers reported in Appendix F are collected using the "Synthetic Data Investigator Prompt" and not via a separate run of H1.

While this explains the first difference to the numbers from Figure 9a in the Anonymization paper (prompt setups slightly differ), the main difference is that for all experiments in the new paper, we used GPT-4-Turbo (made available only after we had finished writing the paper) instead of the base GPT-4 model to limit our costs somewhat (and make use of the higher rate limit). We qualitatively observed multiple times that GPT-4-Turbo performs slightly worse in our inference tasks than GPT-4 (even though it outperforms it when it comes to, e.g., human-preference rankings).

Cheers, Robin

nassimwalha commented 6 months ago

Hey Robin,

thank you for your response! I will close the issue.