cisnlp / simalign

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
MIT License
345 stars 47 forks source link

ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18])) #10

Closed youssefavx closed 1 year ago

youssefavx commented 3 years ago

After running the example code provided I get this error:

>>> import simalign
>>> 
>>> source_sentence = "Sir Nils Olav III. was knighted by the norwegian king ."
>>> target_sentence = "Nils Olav der Dritte wurde vom norwegischen König zum Ritter geschlagen ."
>>> model = simalign.SentenceAligner()
2020-09-13 18:02:40,806 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
I0913 18:02:40.806071 4394976704 simalign.py:47] Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
>>> result = model.get_word_aligns(source_sentence.split(), target_sentence.split())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 181, in get_word_aligns
    vectors = self.embed_loader.get_embed_list(list(bpe_lists))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in get_embed_list
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in <listcomp>
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_bert.py", line 806, in forward
    extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_utils.py", line 248, in get_extended_attention_mask
    input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

I wonder if this is due to my recent update of transformers. If so, that's going to be difficult for me to solve because the newest version of transformers has a fill-mask feature that was not available in previous versions that I'm going to need in conjunction with simalign's invaluable functionality.

Hopefully, this is unrelated. I did cancel the download then restart it again (and it seemed to restart from a fresh file though I could be wrong).

youssefavx commented 3 years ago

Is it possible to use custom models with simalign? (I'm mostly interested in alignment from english to english, not other languages)

youssefavx commented 3 years ago

Maybe this might be the issue?

https://github.com/huggingface/transformers/issues/20#issuecomment-438603774

pdufter commented 3 years ago

Hi @youssefavx thanks for pointing this issue out. Custom models should work with simalign (just pass the path to the model when instantiating SentenceAligner. For your error message: Which version of transformers are you using?

youssefavx commented 3 years ago

Hey @pdufter Awesome! Will experiment with custom models (I assume I could just use the model name like with transformers library? or do I have to find an actual path to those?).

I'm running version 3.1.0.

pdufter commented 3 years ago

At the moment we only tested on transformers==2.3.0

youssefavx commented 3 years ago

Unfortunately I can’t really downgrade because there’s new functionality in the new transformers that is essential.

Do you know if there’s a way to run both versions of a package at the same time in the same application?

If not, then I guess I’ll try to debug this one and report back.

pdufter commented 3 years ago

I do not know whether you can run two versions at the same time. But we anyway plan to make simalign usable with newer transformer versions and to add new features soon, if that helps. In the meantime, if you find the issue, any pull request is obviously highly appreciated.

youssefavx commented 3 years ago

@pdufter Will do if I solve it!

youssefavx commented 3 years ago

Okay, I think I fixed this (or found the problem) but my fix breaks simalign for earlier versions of transformers. I really don't think this is because of any impossibility to have compatibility with earlier versions but more due to my ignorance.

I should note that I have:

  1. Zero experience with Pytorch
  2. Very little experience with transformers

Perhaps you could add an if statement in the code like "if the version is earlier" (a much better if statement would obviously be one that allows you to detect nested arrays inside a tensor or something, as we dont know what huggingface will do at any point in time when they change their packages. It's probably also less tedious to set up the latter)

So here's the problem:

In this function:

def get_embed_list(self, sent_pair):
        if self.emb_model is not None:
            sent_ids = [self.tokenizer.convert_tokens_to_ids(x) for x in sent_pair]

       #Guilty variable!! 20 years in prison for uuu variable!
            inputs = [self.tokenizer.prepare_for_model(sent, return_token_type_ids=True, return_tensors='pt')['input_ids'] for sent in sent_ids]
      #^ This right here

            outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
            # use vectors from layer 8
            vectors = [x[2][self.layer].cpu().detach().numpy()[0][1:-1] for x in outputs]

            return vectors
        else:
            return None

When I print the "inputs" variable (after updating transformers to 3.1.0):

inputs= [tensor([  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
        10155, 10105, 31515, 16997, 11630, 20636,   119,   102]), tensor([  101, 33288, 46495, 10118, 11612, 81898, 10283, 11036, 31515, 16997,
        11611, 17260, 10580, 32017, 95023,   119,   102])]

The tensor that you get is different, which is why we get this error I assume: ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

Whereas when I downgrade transformers, and I print inputs again:

inputs= [tensor([[  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
         10155, 10105, 31515, 16997, 11630, 20636,   119,   102]]), tensor([[  101, 33288, 46495, 10118, 11612, 81898, 10283, 11036, 31515, 16997,
         11611, 17260, 10580, 32017, 95023,   119,   102]])]

So all I had to do was add another (array?) to it. Keep in mind I have no clue whatsoever how to do this appropriately, nor do I have any clue what I'm doing.

I searched online, and came across this solution.

So, in this function, here's the edit I made:

for in_ids in inputs:
    in_ids.resize_(1,len(in_ids))

In this function:

def get_embed_list(self, sent_pair):
        if self.emb_model is not None:
            sent_ids = [self.tokenizer.convert_tokens_to_ids(x) for x in sent_pair]

            inputs = [self.tokenizer.prepare_for_model(sent, return_token_type_ids=True, return_tensors='pt')['input_ids'] for sent in sent_ids]

            for in_ids in inputs:
                in_ids.resize_(1,len(in_ids))

            outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]

            # use vectors from layer 8
            vectors = [x[2][self.layer].cpu().detach().numpy()[0][1:-1] for x in outputs]

            return vectors
        else:
            return None

So you may have better ideas as to what the implications of this edit are and how to better implement it.

youssefavx commented 3 years ago

And testing to make sure that the in_ids are the same before and after the resize:

in_ids before resize tensor([  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
        10155, 10105, 31515, 16997, 11630, 20636,   119,   102])
in_ids after resize tensor([[  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
         10155, 10105, 31515, 16997, 11630, 20636,   119,   102]])
masoudjs commented 3 years ago

@youssefavx Thank you for putting the time to fix this. I think we should add the attention matrix as the new list. I am updating the model to Transformers 3. I will finish it today or tomorrow.

youssefavx commented 3 years ago

@masoudjs Thank you for making such a useful and essential tool

Lukecn1 commented 3 years ago

I had the same issue, but it was resolved by wrapping my data in a torch Dataloader. I am not sure as to why that solved the problem, but solve it, it did.

ZhuoerFeng commented 3 years ago

I had the same issue, but it was resolved by wrapping my data in a torch Dataloader. I am not sure as to why that solved the problem, but solve it, it did.

Modules in torch accept inputs in form of [batch_size, ...] therefore perform .unsqueeze(0) on input/attention_masks tensors would help. which is done by torch.utils.data.DataLoader