explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.76k stars 4.36k forks source link

Filter duplicate vectors when pruning vectors #5397

Open adrianeboyd opened 4 years ago

adrianeboyd commented 4 years ago

How to reproduce the behaviour

When prioritizing vectors to keep, Vocab.prune_vectors doesn't handle existing duplicates from key2row well. By sorting/prioritizing by values from key2row, which may contain duplicate values, prune_vectors may keep multiple copies of the same vector.

Fix:

0x0badc0de commented 3 years ago

Just faced the issue today when tried to prune previously pruned vocabulary. Wonder if the fix can consist of forming a mask for indices/keys which picks up first n_row unique rows. Then use inversed mask to pick up the rest. Not deeply tested, just a raw idea.

Can be implemented like following

def prune_vectors(self, nr_row, batch_size=1024):
    """Reduce the current vector table to `nr_row` unique entries. Words
    mapped to the discarded vectors will be remapped to the closest vector
    among those remaining.
    For example, suppose the original table had vectors for the words:
    ['sat', 'cat', 'feline', 'reclined']. If we prune the vector table to
    two rows, we would discard the vectors for 'feline' and 'reclined'.
    These words would then be remapped to the closest remaining vector
    -- so "feline" would have the same vector as "cat", and "reclined"
    would have the same vector as "sat".
    The similarities are judged by cosine. The original vectors may
    be large, so the cosines are calculated in minibatches, to reduce
    memory usage.
    nr_row (int): The number of rows to keep in the vector table.
    batch_size (int): Batch of vectors for calculating the similarities.
        Larger batch sizes might be faster, while temporarily requiring
        more memory.
    RETURNS (dict): A dictionary keyed by removed words mapped to
        `(string, score)` tuples, where `string` is the entry the removed
        word was mapped to, and `score` the similarity score between the
        two words.
    DOCS: https://spacy.io/api/vocab#prune_vectors
    """
    ops = get_current_ops()
    xp = get_array_module(self.vectors.data)
    # Make sure all vectors are in the vocab
    for orth in self.vectors:
        self[orth]
    # Make prob negative so it sorts by rank ascending
    # (key2row contains the rank)
    priority = [(-lex.prob, self.vectors.key2row[lex.orth], lex.orth)
                for lex in self if lex.orth in self.vectors.key2row]
    priority.sort()
    indices = xp.asarray([i for (prob, i, key) in priority], dtype="uint64")
    keys = xp.asarray([key for (prob, i, key) in priority], dtype="uint64")

    _, unique_indices_idx = xp.unique(indices, return_index=True)
    # unique_indices_idx points to the first occurencies of unique values,
    # its order matches the order of sorted unique values of indices.
    # Later we will need to pick up n_row first unique values from `indices`. Hence sort `unique_indices_idx`.
    unique_indices_idx.sort()

    # A mask allowing to pick up the first n_row unique rows from indices/keys.
    keep_mask = xp.zeros(indices.size, dtype=bool)
    keep_mask[unique_indices_idx[:nr_row]] = True

    keep = xp.ascontiguousarray(self.vectors.data[indices[keep_mask]])
    toss = xp.ascontiguousarray(self.vectors.data[indices[~keep_mask]])
    self.vectors = Vectors(data=keep, keys=keys[keep_mask], name=self.vectors.name)
    syn_keys, syn_rows, scores = self.vectors.most_similar(toss, batch_size=batch_size)
    syn_keys = ops.to_numpy(syn_keys)
    remap = {}
    for i, key in enumerate(ops.to_numpy(keys[~keep_mask])):
        self.vectors.add(key, row=syn_rows[i][0])
        word = self.strings[key]
        synonym = self.strings[syn_keys[i][0]]
        score = scores[i][0]
        remap[word] = (synonym, score)
    return remap
adrianeboyd commented 3 years ago

Sorry this didn't work as expected, and thanks for the suggestion! This issue is kind of low priority on our end right now, but we'll try to come back to it when we find a bit of time.