THU-BPM / MarkLLM

MarkLLM: An Open-Source Toolkit for LLM Watermarking.(EMNLP 2024 Demo)
https://aclanthology.org/2024.emnlp-demo.7/
Apache License 2.0
298 stars 32 forks source link

Unable to Run SIR watermark algorithm #21

Closed Xieyangxinyu closed 3 weeks ago

Xieyangxinyu commented 2 months ago

Hi, I am wondering if you mind sharing compositional-bert-large-uncased model needed for this watermark algorithm? I found this model card but it seems like I have to train it from scratch?

Also, when I set the config file for SIR to the following

{
    "algorithm_name": "SIR",
    "delta": 1.0,
    "chunk_length": 10,
    "scale_dimension": 300,
    "z_threshold": 0.2,
    "transform_model_input_dim": 1024,
    "transform_model_name": "watermark/sir/model/transform_model_cbert.pth",
    "embedding_model_path": "perceptiveshawty/compositional-bert-large-uncased",
    "mapping_name": "watermark/sir/mapping/300_mapping_50272.json"
}

I get the following error message:

  File "/MarkLLM/watermark/sir/sir.py", line 221, in generate_watermarked_text
    encoded_watermarked_text = generate_with_watermark(**encoded_prompt)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/conda/2024-04-29/lib/python3.11/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/conda/2024-04-29/lib/python3.11/site-packages/transformers/generation/utils.py", line 2992, in _sample
    next_token_scores = logits_processor(input_ids, next_token_logits)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/2024-04-29/lib/python3.11/site-packages/transformers/generation/logits_process.py", line 98, in __call__
    scores = processor(input_ids, scores)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/MarkLLM/watermark/sir/sir.py", line 189, in __call__
    scores = self._bias_logits(scores=scores, batched_bias=batched_bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/MarkLLM/watermark/sir/sir.py", line 175, in _bias_logits
    scores = scores + batched_bias * self.config.delta
             ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (152064) must match the size of tensor b (50272) at non-singleton dimension 1

Thanks!

panly2003 commented 2 months ago

Thank you for your questions!

  1. There’s no need to train the compositional-bert-large-uncased model, as it is solely used for embedding purposes.
  2. Based on the error message, it appears you are using an LLM for text generation with a vocab_size of 152,604, but you haven't updated the vocab_size parameter in the transformer_config. Additionally, it is advisable to change the mapping_name in the configuration file to align with the vocabulary size.

Hope this helps! If you have any further questions, feel free to ask. 😊