Closed maxtrem closed 3 years ago
Hi @maxtrem,
In order to handle long inputs, we extract potentially overlapping spans from the Doc
, and pass those into the transformer. So let's say you have a doc of 20 tokens and a window of 15, and a stride of 10. This will cut the doc into a slice of 15 and a slice of 10. If the transformer has a width of 768, we'll get a tensor of (2, 15, 768)
(after padding). The alignment points into a table that refers to a 2D reshaped version of that tensor, so you'll see indices up to 30.
So what you'll need to do is either flatten the nested doc._.trf_data.strings
list, so that you can point into it with the index, or map the index back to two dimensions like this:
seq_size = len(doc._.trf_data.strings[-1])
batch = index // seq_size
item = index % seq_size
string = strings[batch][item]
Ah okay, it makes perfectly sense now. I was actually wondering why there is a 2
in the first dimension in the tensor shape(2, 136, 768)
.
I think I'm well served with this explanation, thanks a lot!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I'm trying to extract trf vectors for certain spans in spacy and encountered several IndexErrors on the way. Here is one example:
As I understand it the last output (
174
) should be the alignment with the wordpieces and trf vectors. However this raises anIndexError
as both wordpieces and vectors are significantly shorter.Is this a bug or intentional? If the latter is the case, how could I extract the correct alignment?
Thanks!
Here are the full wordpieces and alignments:
Your Environment
'3.8.5 (default, Sep 4 2020, 07:30:14) \n[GCC 7.3.0]'
,'3.8.1 (default, Jan 8 2020, 22:29:32) \n[GCC 7.3.0]'
3.0.1