Questions about the unstable rating accuracy of yelp datasets

szzhh commented 2 weeks ago

PixPin_2024-09-12_13-39-09 Use Random_API combined with chatgpt3.5 to generate a set of data, then use Roberta_base to perform three downstream trainings and make predictions on the test set. For the business category, the results are stable, but for the rating category, the results of each run vary greatly, as shown in the figure. Do you encounter this problem during the experiment?

AlphaPav commented 1 week ago

Thanks for the question! We indeed observed the variance in downstream classification accuracy, but not as large as shown in your figure. For each run, we use validation data to select the best epoch for downstream classification. We report the average accuracy over 3 runs.

szzhh commented 1 week ago

Thank you for your reply above, I still have some questions I would like to consult about the paper 'Differentiated Private Synthetic Data via Foundation Model APIs 2: Text'.

The hyperparameter settings of the yelp dataset in the figure below are K=3 and L=1. Does it mean that the results of the yelp dataset use the results of the original PE method?
For a classification dataset, such as Yelp, when using RANDOM_API, is the number of generated texts for each category random, without following the ratio of the two labels combined to form a new category?
When voting, is the category ratio taken into account, or is the selection made directly based on top-k? I am interested in the specific voting process for the classification dataset, which does not seem to be mentioned in the paper.

Thank you for your patience in reading my question. I am very much looking forward to your reply. Thank you very much!

AlphaPav commented 6 days ago

Thanks for the great questions!

For GPT-3.5 Yelp, we find that using K=3, L=1 works better because the quality of GPT-3.5 generated text is high, which can be seen as integrating the original PE method with our designed APIs. For open-source models, we use L>1.
We call RANDOM_API to generate data for each category separately. The sample ratio at each category follows the original data distribution.
We perform voting for each category separately, selecting the top-k synthetic samples for each category.

Feel free to ask if you have more questions!

AI-secure / aug-pe

Questions about the unstable rating accuracy of yelp datasets #4