Open adrianeboyd opened 4 years ago
Just faced the issue today when tried to prune previously pruned vocabulary. Wonder if the fix can consist of forming a mask for indices/keys which picks up first n_row unique rows. Then use inversed mask to pick up the rest. Not deeply tested, just a raw idea.
Can be implemented like following
def prune_vectors(self, nr_row, batch_size=1024):
"""Reduce the current vector table to `nr_row` unique entries. Words
mapped to the discarded vectors will be remapped to the closest vector
among those remaining.
For example, suppose the original table had vectors for the words:
['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
two rows, we would discard the vectors for 'feline' and 'reclined'.
These words would then be remapped to the closest remaining vector
-- so "feline" would have the same vector as "cat", and "reclined"
would have the same vector as "sat".
The similarities are judged by cosine. The original vectors may
be large, so the cosines are calculated in minibatches, to reduce
memory usage.
nr_row (int): The number of rows to keep in the vector table.
batch_size (int): Batch of vectors for calculating the similarities.
Larger batch sizes might be faster, while temporarily requiring
more memory.
RETURNS (dict): A dictionary keyed by removed words mapped to
`(string, score)` tuples, where `string` is the entry the removed
word was mapped to, and `score` the similarity score between the
two words.
DOCS: https://spacy.io/api/vocab#prune_vectors
"""
ops = get_current_ops()
xp = get_array_module(self.vectors.data)
# Make sure all vectors are in the vocab
for orth in self.vectors:
self[orth]
# Make prob negative so it sorts by rank ascending
# (key2row contains the rank)
priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
for lex in self if lex.orth in self.vectors.key2row]
priority.sort()
indices = xp.asarray([i for (prob, i, key) in priority], dtype="uint64")
keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")
_, unique_indices_idx = xp.unique(indices, return_index=True)
# unique_indices_idx points to the first occurencies of unique values,
# its order matches the order of sorted unique values of indices.
# Later we will need to pick up n_row first unique values from `indices`. Hence sort `unique_indices_idx`.
unique_indices_idx.sort()
# A mask allowing to pick up the first n_row unique rows from indices/keys.
keep_mask = xp.zeros(indices.size, dtype=bool)
keep_mask[unique_indices_idx[:nr_row]] = True
keep = xp.ascontiguousarray(self.vectors.data[indices[keep_mask]])
toss = xp.ascontiguousarray(self.vectors.data[indices[~keep_mask]])
self.vectors = Vectors(data=keep, keys=keys[keep_mask], name=self.vectors.name)
syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
syn_keys = ops.to_numpy(syn_keys)
remap = {}
for i, key in enumerate(ops.to_numpy(keys[~keep_mask])):
self.vectors.add(key, row=syn_rows[i][0])
word = self.strings[key]
synonym = self.strings[syn_keys[i][0]]
score = scores[i][0]
remap[word] = (synonym, score)
return remap
Sorry this didn't work as expected, and thanks for the suggestion! This issue is kind of low priority on our end right now, but we'll try to come back to it when we find a bit of time.
How to reproduce the behaviour
When prioritizing vectors to keep,
Vocab.prune_vectors
doesn't handle existing duplicates fromkey2row
well. By sorting/prioritizing by values fromkey2row
, which may contain duplicate values,prune_vectors
may keep multiple copies of the same vector.Fix:
indices
and adjustkeys
accordinglykey2row
keys (which is not compatible with the keys truncation from #5348, which is overly simple; I think you have to re-add the duplicate rows to vectors after initialization withVectors.add(row=)
)