Why is the Coaid Paraphrase dataset for ChatGPT different than the coaid paraphrase data for the other models? Seems like an error

alimohammadbeigi / Model-Attribution-in-Machine-Generated-Disinformation

Model Attribution in Machine-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning

2 stars 0 forks source link

Why is the Coaid Paraphrase dataset for ChatGPT different than the coaid paraphrase data for the other models? Seems like an error #1

Open aa-dank opened 2 months ago

aa-dank commented 2 months ago

Specifically this data: Model-Attribution-in-Machine-Generated-Disinformation/data/filtered_llm/gpt-3.5-turbo/coaid/synthetic-gpt-3.5-turbo_coaid_paraphrase_generation_filtered.csv

The features of this dataset are... 'generation_approach', 'label', 'news_id', 'news_text', 'synthetic misinformation', 'theme'

whereas the coaid paraphrase datasets have these features: 'generation_approach', 'human', 'label', 'news_id', 'prompt', 'synthetic misinformation', 'theme sentence or passage'

aa-dank commented 2 months ago

Also the Llama 2 data is missing twenty rows of prompts and data that do exist in the equivalent Vacuna coaid paraphrase data. Was that intentional? Model-Attribution-in-Machine-Generated-Disinformation/data/filtered_llm/llama2_70b/coaid/synthetic-llama2_70b_coaid_paraphrase_generation_filtered.csv