baoguangsheng / fast-detect-gpt

Code base for "Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature".
MIT License
157 stars 24 forks source link

Excellent work, and what's the min GPU requirement? #13

Open gitgoready opened 1 month ago

gitgoready commented 1 month ago

Thank you ! Is it suitable for other LLMS?

baoguangsheng commented 1 month ago

The GPU requirement depends on the size of the LLMs that you are using. Generally a GPU with a memory of 1.5*model-size works in practice.

Fast-DetectGPT can be applied to any auto-regressive LLMs, such as GPT series models. We do not try it on non-autoregressive LLMs, where the answer is unknown to the question.

gitgoready commented 1 month ago

Thanks,can we use glm3 6B、qwen 7B etc as scoring_model or reference_model ?

gitgoready commented 1 month ago

chatglm3 errors ask for help,thanks a lot:

Traceback (most recent call last): File "scripts/local_infer.py", line 90, in run(args) File "scripts/local_infer.py", line 64, in run tokenized = scoring_tokenizer(text, return_tensors="pt", padding=True, return_token_type_ids=False).to(args.device) File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2538, in call encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2644, in _call_one return self.encode_plus( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2717, in encode_plus return self._encode_plus( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 652, in _encode_plus return self.prepare_for_model( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3196, in prepare_for_model encoded_inputs = self.pad( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3001, in pad encoded_inputs = self._pad( File "/home/user/.cache/huggingface/modules/transformers_modules/models--zhipu--chatglm3-6b/tokenization_chatglm.py", line 254, in _pad assert self.padding_side == "left" AssertionError

baoguangsheng commented 1 month ago

It seems that chatglm3 requires a padding_side with "left", which can be configured at line 59 in model.py. Good luck!

gitgoready commented 1 month ago

For chatglm3 change default dataset to pubmed, but seems don't work

if name == 'main': parser = argparse.ArgumentParser() parser.add_argument('--reference_model_name', type=str, default="chatglm3-6b") #gpt-neo-2.7B '': 'zhipu/chatglm3-6b', use gpt-j-6B for more accurate detection parser.add_argument('--scoring_model_name', type=str, default="chatglm3-6b") #gpt-neo-2.7B parser.add_argument('--dataset', type=str, default="pubmed") #xsum parser.add_argument('--ref_path', type=str, default="./local_infer_ref") parser.add_argument('--device', type=str, default="cuda") parser.add_argument('--cache_dir', type=str, default="../../") args = parser.parse_args()

run(args)

Please enter your text: (Press Enter twice to start processing) 学术写作是科学研究和学术交流的重要环节。无论是撰写研究论文、学术报告,还是进行学术演讲,良好的写作技巧都能够提高信息传递的效率和准确性。然而,许多学者和研究人员在学术写作过程中仍然面临一系列挑战。

Fast-DetectGPT criterion is -7.0781, suggesting that the text has a probability of 0% to be fake.

Please enter your text: (Press Enter twice to start processing) 首先,学术写作需要清晰、简洁、准确地表达研究成果和观点。如何组织论文结构、选择合适的词汇、避免冗长和模糊的表达,都是需要认真思考的问题。 其次,引用和参考文献的规范使用也是学术写作的关键。正确引用他人的研究成果,不仅是对知识产权的尊重,更是保证学术诚信的基石。 最后,学术写作需要不断的练习和改进。每一篇论文都是一个机会,让我们更好地理解自己的研究,提高表达能力,进一步推动学术进步。 因此,本文旨在探讨学术写作的关键技巧,帮助研究人员更好地应对学术写作中的挑战。

Fast-DetectGPT criterion is -9.7109, suggesting that the text has a probability of 0% to be fake.

Please enter your text: (Press Enter twice to start processing)

gitgoready commented 1 month ago

如上的 fake 什么意思,假信息,幻觉? 还是说只要是大模型返回的就是fake?

baoguangsheng commented 1 month ago

这儿的fake指的是机器生成的。从您修改后的demo输出上看,您已经成功拿到Fast-DetectGPT criterion,但它的值看起来幅度比较大。不同的模型,criterion值的分布是不一样的。为了估计它的fake概率,需要有这个模型在一些标准数据集上测试的结果分布作为参照。您可以将local_info_ref文件夹下的gpt-neo-2.7B模型的结果替换成您的模型的测试结果,以获得有效的概率估计。

gitgoready commented 1 month ago

非常感谢您的答复!

1、生成glm3在标准数据集上的测试结果是修改使用 main.sh 或main_ext.sh 中 datasets="xsum squad writing" source_models="bloom-7b1 opt-13b llama-13b llama2-13b" 这些然后生成您说json文件是吧,这里似乎都没有pubmed

2、下面的model.py 代码是否意味着glm3 只能使用pubmed 数据集,其他的数据集padding_side right,和glm3的冲突?还能支持其他类型的数据吗?

def load_tokenizer(model_name, for_dataset, cache_dir):
    model_fullname = get_model_fullname(model_name)
    optional_tok_kwargs = {}
    if "facebook/opt-" in model_fullname:
        print("Using non-fast tokenizer for OPT")
        optional_tok_kwargs['fast'] = False
    if for_dataset in ['pubmed']:
        optional_tok_kwargs['padding_side'] = 'left'
    else:
        optional_tok_kwargs['padding_side'] = 'right'
    base_tokenizer = from_pretrained(AutoTokenizer, model_fullname, optional_tok_kwargs, cache_dir=cache_dir)
    if base_tokenizer.pad_token_id is None:
        base_tokenizer.pad_token_id = base_tokenizer.eos_token_id
        if '13b' in model_fullname:
            base_tokenizer.pad_token_id = 0
    return base_tokenizer
baoguangsheng commented 1 month ago

我们的demo reference结果是使用gpt3to4.sh生成,主要是数据集的不同,我们使用了gpt-3.5和gpt-4的生成文本。最好根据您的具体需要(包括覆盖哪些内容领域、语言和源模型等),生成您自己的测试数据集,这样估计出来的概率在您的使用场景下会更加准确。

代码model.py中的配置部分完全可以按您的需要改动,不用拘泥于之前的设定,满足您的模型配置要求就可以了。之前的设定只针对我们测试过的模型,不一定具有通用性(包括针对pubmed的特殊配置)。

此外,我建议您使用glm3的基础模型(SFT/RLHF之前的模型)。Chat风格的模型可能需要合适的prompt才能有可靠的response分布的预测,这有可能是您的criterion幅度比较大的原因。同样,我们在实验中使用的采样和打分模型也都是基础模型。

gitgoready commented 1 month ago

感谢回复,已测试使用chatglm3-6b-base。 但是同样在chatglm base模型,加载数据集时load_tokenizer的 padding_side 必须为left的问题如何解决,现在除了pubmed其他数据集都为right?

Traceback (most recent call last): File "scripts/local_infer.py", line 94, in run(args) File "scripts/local_infer.py", line 66, in run tokenized = scoring_tokenizer(text, return_tensors="pt", padding=True, return_token_type_ids=False).to(args.device) File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2883, in call encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2989, in _call_one return self.encode_plus( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3062, in encode_plus return self._encode_plus( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 722, in _encode_plus return self.prepare_for_model( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3541, in prepare_for_model encoded_inputs = self.pad( File "/home/user/anaconda3/envs/detectgpt3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3346, in pad encoded_inputs = self._pad( File "/home/user/.cache/huggingface/modules/transformers_modules/chatglm3-6b-base/tokenization_chatglm.py", line 254, in _pad assert self.padding_side == "left" AssertionError

def load_tokenizer(model_name, for_dataset, cache_dir):
    model_fullname = get_model_fullname(model_name)
    optional_tok_kwargs = {}
    if "facebook/opt-" in model_fullname:
        print("Using non-fast tokenizer for OPT")
        optional_tok_kwargs['fast'] = False
    if for_dataset in ['pubmed']:
        optional_tok_kwargs['padding_side'] = 'left'
    else:
        optional_tok_kwargs['padding_side'] = 'right'
    base_tokenizer = from_pretrained(AutoTokenizer, model_fullname, optional_tok_kwargs, cache_dir=cache_dir)
    if base_tokenizer.pad_token_id is None:
        base_tokenizer.pad_token_id = base_tokenizer.eos_token_id
        if '13b' in model_fullname:
            base_tokenizer.pad_token_id = 0
    return base_tokenizer
baoguangsheng commented 1 month ago

如我前一个回答,您可以按您的模型需要将padding_side统一改成'left',不会影响检测结果。

代码model.py中的配置部分完全可以按您的需要改动,不用拘泥于之前的设定,满足您的模型配置要求就可以了。