AmritaBh / ConDA-gen-text-detection

Code for the paper: ConDA: Contrastive Domain Adaptation for AI-generated Text Detection
MIT License
30 stars 0 forks source link

Assistance Request for ConDA-gen-text-detection Code Implementation #1

Open 15399215469 opened 7 months ago

15399215469 commented 7 months ago

Background I am a college student currently working on implementing the ConDA-gen-text-detection code from Amrita Bhattacharjee's GitHub repository. During this process, I have encountered some issues and would appreciate some guidance and assistance.

Process I downloaded the data from TuringBench, specifically the TuringBench.zip file, and successfully processed all CSV files into real_dataset.jsonl and fake_dataset.jsonl files. After completing the preprocessing steps, I generated the relevant scrambled text and proceeded to run the multi_domain_runner.py script, resulting in the creation of a .pt model file. Problem However, when attempting to evaluate the model using the evaluation.py script, I did not obtain the F1 score mentioned in Amrita Bhattacharjee's paper. I am using the robert-base coding tool for this implementation.

Request I would greatly appreciate some insights into the correct approach for obtaining the results mentioned in the paper. There may be specific parameters or steps that I overlooked during the process.

Relevant Information Encoder tool used: robert-base Data source: TuringBench.zip file Processed files: real_dataset.jsonl and fake_dataset.jsonl Contact Information Name: Ruifan Zhao University: Mongolian University Email: zhaomr314@gmail.com Thank you for your time and assistance. Looking forward to your response.

AmritaBh commented 7 months ago

Hi Ruifan,

The ConDA model is supposed to be trained with data from one source generator and one target generator, not all the data from TuringBench at once (as described in the paper). After pre-processing the files from the two generators, you need to perform some text transformation (such as synonym replacement, as described in the paper). An example notebook for this is also provided in the repo. The files that you get after this step will be used for training. Please make sure to edit/change paths and directories and other parameters as necessary when you run the scripts.

15399215469 commented 6 months ago

Dear Amrita Bhattacharjee,

Since I have not modified your code, I suspect that there is a problem with the processing of the data set. I will compile my processing process into a pdf and provide it to you. I hope to get your help. If possible, please provide me with the compressed package of your data set. Thank you very much!

Amrita Bhattacharjee @.***> 于2023年12月18日周一 06:54写道:

Hi Ruifan,

The ConDA model is supposed to be trained with data from one source generator and one target generator, not all the data from TuringBench at once (as described in the paper). After pre-processing the files from the two generators, you need to perform some text transformation (such as synonym replacement, as described in the paper). An example notebook for this is also provided in the repo. The files that you get after this step will be used for training. Please make sure to edit/change paths and directories and other parameters as necessary when you run the scripts.

— Reply to this email directly, view it on GitHub https://github.com/AmritaBh/ConDA-gen-text-detection/issues/1#issuecomment-1859312661, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6UKW3XR6GO7XV3XK36H7U3YJ5Z3JAVCNFSM6AAAAABAXQERQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJZGMYTENRWGE . You are receiving this because you authored the thread.Message ID: @.***>

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 24/01/18 22:43:22