HazyResearch / domino

Apache License 2.0
134 stars 24 forks source link

generate_candidate_descriptions() is not working #15

Open tanutarou opened 11 months ago

tanutarou commented 11 months ago

When I run examples/01_intro.ipynb on Google Colab, I get an error in the following cell.

Could you tell me how to fix this problem? Thank you!

from domino import generate_candidate_descriptions
phrase_templates = [
    "a photo of [MASK].",
    "a photo of {} [MASK].",
    "a photo of [MASK] {}.",
    "a photo of [MASK] {} [MASK].",
]

text_df = generate_candidate_descriptions(
    templates=phrase_templates,
    num_candidates=10_000
)
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
(…)cased/resolve/main/tokenizer_config.json: 100%
28.0/28.0 [00:00<00:00, 696B/s]
(…)bert-base-uncased/resolve/main/vocab.txt: 100%
232k/232k [00:00<00:00, 5.61MB/s]
(…)base-uncased/resolve/main/tokenizer.json: 100%
466k/466k [00:00<00:00, 10.5MB/s]
(…)rt-base-uncased/resolve/main/config.json: 100%
570/570 [00:00<00:00, 20.6kB/s]
model.safetensors: 100%
440M/440M [00:03<00:00, 147MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 625/625 [01:59<00:00,  5.22it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-14-260a50d5a169>](https://localhost:8080/#) in <cell line: 9>()
      7 ]
      8 
----> 9 text_df = generate_candidate_descriptions(
     10     templates=phrase_templates,
     11     num_candidates=10_000

16 frames
[/usr/local/lib/python3.10/dist-packages/meerkat/block/manager.py](https://localhost:8080/#) in add_column(self, col, name)
    257         type."""
    258         if len(self) > 0 and len(col) != self.nrows:
--> 259             raise ValueError(
    260                 f"Cannot add column '{name}' with length {len(col)} to `BlockManager` "
    261                 f" with length {self.nrows} columns."

ValueError: Cannot add column 'pkey' with length 10000 to `BlockManager`  with length 100000 columns.
meng8407 commented 9 months ago

I have the same problem. How can I solve it?

Jjx003 commented 8 months ago

+1

marti99b1 commented 7 months ago

I have the same problem!

Supltz commented 4 months ago

I re-wrote the function _def _forwardmlm in generate.py as below and the bug was solved. Not sure if this is the correct way and if it will have an impact on the results. Waiting for the official correction.

def _forward_mlm(words):
        output_phrases = []
        output_probs = []
        for word in words:
            input_phrases = [template.format(word) for template in templates]
            inputs = tokenizer(input_phrases, return_tensors="pt", padding=True).to(device)
            input_ids = inputs["input_ids"]
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1).detach()
            top_k_out = probs.topk(k=k, dim=-1)

            word_probs = []
            word_phrases = []
            for sent_idx in range(probs.shape[0]):
                mask_mask = input_ids[sent_idx] == PAD_TOKEN_ID
                mask_range = torch.arange(mask_mask.sum())
                token_ids = top_k_out.indices[sent_idx, mask_mask]
                token_probs = top_k_out.values[sent_idx, mask_mask]

                best_local_idx = token_probs.mean(dim=1).argmax()
                output_ids = torch.clone(input_ids[sent_idx])
                output_ids[mask_mask] = token_ids[mask_range, best_local_idx]
                word_phrases.append(tokenizer.decode(output_ids, skip_special_tokens=True))
                # Calculate the mean probability of the selected tokens
                mean_probability = token_probs[mask_range, best_local_idx].mean().item()  # Now calculating mean first
                word_probs.append(mean_probability)

            # Select the best phrase based on some criteria, here assumed the highest average probability
            best_idx = np.argmax(word_probs)
            output_phrases.append(word_phrases[best_idx])
            output_probs.append(word_probs[best_idx])

        return {"prob": output_probs, "output_phrase": output_phrases}
scao0208 commented 3 weeks ago

I re-wrote the function _def _forwardmlm in generate.py as below and the bug was solved. Not sure if this is the correct way and if it will have an impact on the results. Waiting for the official correction.

def _forward_mlm(words):
        output_phrases = []
        output_probs = []
        for word in words:
            input_phrases = [template.format(word) for template in templates]
            inputs = tokenizer(input_phrases, return_tensors="pt", padding=True).to(device)
            input_ids = inputs["input_ids"]
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1).detach()
            top_k_out = probs.topk(k=k, dim=-1)

            word_probs = []
            word_phrases = []
            for sent_idx in range(probs.shape[0]):
                mask_mask = input_ids[sent_idx] == PAD_TOKEN_ID
                mask_range = torch.arange(mask_mask.sum())
                token_ids = top_k_out.indices[sent_idx, mask_mask]
                token_probs = top_k_out.values[sent_idx, mask_mask]

                best_local_idx = token_probs.mean(dim=1).argmax()
                output_ids = torch.clone(input_ids[sent_idx])
                output_ids[mask_mask] = token_ids[mask_range, best_local_idx]
                word_phrases.append(tokenizer.decode(output_ids, skip_special_tokens=True))
                # Calculate the mean probability of the selected tokens
                mean_probability = token_probs[mask_range, best_local_idx].mean().item()  # Now calculating mean first
                word_probs.append(mean_probability)

            # Select the best phrase based on some criteria, here assumed the highest average probability
            best_idx = np.argmax(word_probs)
            output_phrases.append(word_phrases[best_idx])
            output_probs.append(word_probs[best_idx])

        return {"prob": output_probs, "output_phrase": output_phrases}

It works for me. Thanks a lot!