MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.43k stars 342 forks source link

Newer versions of Spacy transformer model backends failing #246

Open Louis-Paul-Bowman opened 1 month ago

Louis-Paul-Bowman commented 1 month ago

I use spacy's transformer model for other purposes (such as NER), so re-using the same model made sense. Looks like Spacy made some tweaks to their syntax which are breaking KeyBERT's spacy backend.

Sample code:

from keybert import KeyBERT
from spacy import load

nlp = load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
kw_model = KeyBERT(model=nlp)

text = "This is a test sentence."

keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)
print(keywords)

Expected behavior: prints [("test", ...)]

Observed behavior:

Traceback (most recent call last):
  File "...\anaconda3\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 84, in embed
    self.embedding_model(doc)._.trf_data.tensors[-1][0].tolist()
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...\test.py", line 9, in <module>
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)  
  File "...\envs\env\lib\site-packages\keybert\_model.py", line 195, in extract_keywords
    doc_embeddings = self.model.embed(docs)
  File "...\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 88, in embed
    self.embedding_model("An empty document")
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

Package versions: cupy-cuda11x 12.3.0 curated-tokenizers 0.0.9 curated-transformers 0.1.1 en-core-web-trf 3.7.3 keybert 0.8.5 keyphrase-vectorizers 0.0.13 safetensors 0.4.4 scikit-learn 1.5.1 scipy 1.13.1 sentence-transformers 3.0.1 spacy 3.7.5 spacy-alignments 0.9.1 spacy-curated-transformers 0.2.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 spacy-transformers 1.3.5 thinc 8.2.5 tokenizers 0.15.2 transformers 4.36.2

MaartenGr commented 1 month ago

Thank you for sharing this issue! If I'm not mistaken, it seems this is a result of an updated version of SpaCy. I believe there should be an additional check here to see which version of SpaCy is being used and update it to then use DocTransformerOutput. If you are interested, a PR would be great. If you do not have the time, I can start working on it.

Louis-Paul-Bowman commented 1 month ago

I'll give it a look. From what I can tell easiest solution is probably just to check the spacy version / existence of curated-transformers, set a flag, then in embed replace ._.trf_tensors... with .last_hidden_layer_state

https://spacy.io/api/curatedtransformer#doctransformeroutput-lasthiddenlayerstate

Louis-Paul-Bowman commented 1 month ago

Hello again. I tinkered with it for as long as I had time for today, but didn't make a PR because while the code runs I don't think its functioning as intended. I think I may be missing some of your original logic, or maybe the new curated-transformers have gotten rid of a special '\<s>' token which was being used as the document embedding in the final layer, but in my minimal example the document embedding (first token of final layer) is all zeroes (I think because I have a line break? unknown.)

Some notes: As indicated in your linked issue, spacy moved to curated-transformers which has changed the properties of the transformer output. The new one has a few key properties: -.last_hidden_layer_state PER-TOKEN tensors of the final layer, always present -.all_hidden_layer_states the full model activations, only available if setting all_layer_outputs=True (on init, or with nlp.select_pipe("transformer")) -.embedding_layer unclear how this differs from last_hidden_layer_state, the first token in my case was also all zeroes (and is also only available when all_layer_outputs is True)

_spacy.py

import numpy as np
from tqdm import tqdm
from typing import List
from packaging import version
from spacy import __version__ as spacy_version
from keybert.backend import BaseEmbedder

class SpacyBackend(BaseEmbedder):
    """Spacy embedding model

    The Spacy embedding model used for generating document and
    word embeddings.

    Arguments:
        embedding_model: A spacy embedding model

    Usage:

    To create a Spacy backend, you need to create an nlp object and
    pass it through this backend:

    ```python
    import spacy
    from keybert.backend import SpacyBackend

    nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
    spacy_model = SpacyBackend(nlp)
To load in a transformer model use the following:

```python
import spacy
from thinc.api import set_gpu_allocator, require_gpu
from keybert.backend import SpacyBackend

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)
spacy_model = SpacyBackend(nlp)
```

If you run into gpu/memory-issues, please use:

```python
import spacy
from keybert.backend import SpacyBackend

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
spacy_model = SpacyBackend(nlp)
```
"""

def __init__(self, embedding_model):
    super().__init__()

    self.curated_transformers = False

    if "spacy" in str(type(embedding_model)):
        self.embedding_model = embedding_model
        if "transformer" in self.embedding_model.component_names:
            if version.parse(spacy_version) >= version.parse("3.7.0"):
                self.curated_transformers = True   
    else:
        raise ValueError(
            "Please select a correct Spacy model by either using a string such as 'en_core_web_md' "
            "or create a nlp model using: `nlp = spacy.load('en_core_web_md')"
        )

def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray:
    """Embed a list of n documents/words into an n-dimensional
    matrix of embeddings

    Arguments:
        documents: A list of documents or words to be embedded
        verbose: Controls the verbosity of the process

    Returns:
        Document/words embeddings with shape (n, m) with `n` documents/words
        that each have an embeddings size of `m`
    """

    # Extract embeddings from a transformer model
    if "transformer" in self.embedding_model.component_names:
        embeddings = []
        for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
            try:
                if self.curated_transformers:
                    embedding = (
                        self.embedding_model(doc)._.trf_data.last_hidden_layer_state.data[0].tolist()
                    )
                else:
                    embedding = (
                        self.embedding_model(doc)._.trf_data.tensors[-1][0].tolist()
                    )
            except:
                if self.curated_transformers:
                    embedding = (
                        self.embedding_model("An empty document")
                        ._.last_hidden_layer_state.data[0]
                        .tolist()
                    )
                else:
                    embedding = (
                        self.embedding_model("An empty document")
                        ._.trf_data.tensors[-1][0]
                        .tolist()
                    )
            embeddings.append(embedding)
        embeddings = np.array(embeddings)

    # Extract embeddings from a general spacy model
    else:
        embeddings = []
        for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
            try:
                vector = self.embedding_model(doc).vector
            except ValueError:
                vector = self.embedding_model("An empty document").vector
            embeddings.append(vector)
        embeddings = np.array(embeddings)

    return embeddings

my test file

from keybert import KeyBERT from spacy import load, require_gpu

require_gpu() nlp = load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) kw_model = KeyBERT(model=nlp)

test_text = """ Sherlock Holmes (/ˈʃɜːrlɒk ˈhoʊmz/) is a fictional detective created by British author Arthur Conan Doyle. Referring to himself as a "consulting detective" in his stories, Holmes is known for his proficiency with observation, deduction, forensic science and logical reasoning that borders on the fantastic, which he employs when investigating cases for a wide variety of clients, including Scotland Yard. The character Sherlock Holmes first appeared in print in 1887's A Study in Scarlet. His popularity became widespread with the first series of short stories in The Strand Magazine, beginning with "A Scandal in Bohemia" in 1891; additional tales appeared from then until 1927, eventually totalling four novels and 56 short stories. All but one are set in the Victorian or Edwardian eras between 1880 and 1914. Most are narrated by the character of Holmes's friend and biographer, Dr. John H. Watson, who usually accompanies Holmes during his investigations and often shares quarters with him at the address of 221B Baker Street, London, where many of the stories begin. Though not the first fictional detective, Sherlock Holmes is arguably the best-known. By the 1990s, over 25,000 stage adaptations, films, television productions, and publications were already featuring the detective, and Guinness World Records lists him as the most portrayed human literary character in film and television history. Holmes's popularity and fame are such that many have believed him to be not a fictional character but an actual individual; numerous literary and fan societies have been founded on this pretence. Avid readers of the Holmes stories helped create the modern practice of fandom. The character and stories have had a profound and lasting effect on mystery writing and popular culture as a whole, with the original tales, as well as thousands written by authors other than Conan Doyle, being adapted into stage and radio plays, television, films, video games, and other media for over one hundred years. Edgar Allan Poe's C. Auguste Dupin is generally acknowledged as the first detective in fiction and served as the prototype for many later characters, including Holmes. Conan Doyle once wrote, "Each [of Poe's detective stories] is a root from which a whole literature has developed ... Where was the detective story until Poe breathed the breath of life into it?" Similarly, the stories of Émile Gaboriau's Monsieur Lecoq were extremely popular at the time Conan Doyle began writing Holmes, and Holmes's speech and behaviour sometimes follow those of Lecoq. Doyle has his main characters discuss these literary antecedents near the beginning of A Study in Scarlet, which is set soon after Watson is first introduced to Holmes. Watson attempts to compliment Holmes by comparing him to Dupin, to which Holmes replies that he found Dupin to be "a very inferior fellow" and Lecoq to be "a miserable bungler". Conan Doyle repeatedly said that Holmes was inspired by the real-life figure of Joseph Bell, a surgeon at the Royal Infirmary of Edinburgh, whom Conan Doyle met in 1877 and had worked for as a clerk. Like Holmes, Bell was noted for drawing broad conclusions from minute observations. However, he later wrote to Conan Doyle: "You are yourself Sherlock Holmes and well you know it". Sir Henry Littlejohn, Chair of Medical Jurisprudence at the University of Edinburgh Medical School, is also cited as an inspiration for Holmes. Littlejohn, who was also Police Surgeon and Medical Officer of Health in Edinburgh, provided Conan Doyle with a link between medical investigation and the detection of crime. Other possible inspirations have been proposed, though never acknowledged by Doyle, such as Maximilien Heller, by French author Henry Cauvain. In this 1871 novel (sixteen years before the first appearance of Sherlock Holmes), Henry Cauvain imagined a depressed, anti-social, opium-smoking polymath detective, operating in Paris. It is not known if Conan Doyle read the novel, but he was fluent in French. Similarly, Michael Harrison suggested that a German self-styled "consulting detective" named Walter Scherer may have been the model for Holmes. Details of Sherlock Holmes' life in Conan Doyle's stories are scarce and often vague. Nevertheless, mentions of his early life and extended family paint a loose biographical picture of the detective. A statement of Holmes' age in "His Last Bow" places his year of birth at 1854; the story, set in August 1914, describes him as sixty years of age. His parents are not mentioned, although Holmes mentions that his "ancestors" were "country squires". In "The Adventure of the Greek Interpreter", he claims that his grandmother was sister to the French artist Vernet, without clarifying whether this was Claude Joseph, Carle, or Horace Vernet. Holmes' brother Mycroft, seven years his senior, is a government official. Mycroft has a unique civil service position as a kind of human database for all aspects of government policy. Sherlock describes his brother as the more intelligent of the two, but notes that Mycroft lacks any interest in physical investigation, preferring to spend his time at the Diogenes Club. Holmes says that he first developed his methods of deduction as an undergraduate; his earliest cases, which he pursued as an amateur, came from his fellow university students. A meeting with a classmate's father led him to adopt detection as a profession. In the first Holmes tale, A Study in Scarlet, financial difficulties lead Holmes and Dr. Watson to share rooms together at 221B Baker Street, London. Their residence is maintained by their landlady, Mrs. Hudson. Holmes works as a detective for twenty-three years, with Watson assisting him for seventeen of those years. Most of the stories are frame narratives written from Watson's point of view, as summaries of the detective's most interesting cases. Holmes frequently calls Watson's records of Holmes's cases sensational and populist, suggesting that they fail to accurately and objectively report the "science" of his craft: Detection is, or ought to be, an exact science and should be treated in the same cold and unemotional manner. You have attempted to tinge it [A Study in Scarlet] with romanticism, which produces much the same effect as if you worked a love-story or an elopement into the fifth proposition of Euclid. ... Some facts should be suppressed, or, at least, a just sense of proportion should be observed in treating them. The only point in the case which deserved mention was the curious analytical reasoning from effects to causes, by which I succeeded in unravelling it. Nevertheless, when Holmes recorded a case himself, he was forced to concede that he could more easily understand the need to write it in a manner that would appeal to the public rather than his intention to focus on his own technical skill. Holmes's friendship with Watson is his most significant relationship. When Watson is injured by a bullet, although the wound turns out to be "quite superficial", Watson is moved by Holmes's reaction: """

keywords = kw_model.extract_keywords(test_text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=100, use_mmr=True) print(*keywords, sep="\n")



Outputs from test file:
('000', 0.0)
('generally', 0.0)
('ˈʃɜːrlɒk', 0.0)
('earliest', 0.0)
('investigation', 0.0)
('helped', 0.0)
('including', 0.0)
('official', 0.0)
('1990s', 0.0)
('know', 0.0)
('describes', 0.0)
('auguste', 0.0)
('like', 0.0)
('later', 0.0)
('assisting', 0.0)
('thousands', 0.0)
('dr', 0.0)
('deserved', 0.0)
('monsieur', 0.0)
('fifth', 0.0)
('exact', 0.0)
('british', 0.0)
('year', 0.0)
('wrote', 0.0)
('literary', 0.0)
('lead', 0.0)
('self', 0.0)
('mrs', 0.0)
('kind', 0.0)
('years', 0.0)
('1880', 0.0)
('edinburgh', 0.0)
('senior', 0.0)
('cited', 0.0)
('tales', 0.0)
('lasting', 0.0)
...
all with 0.0 score
MaartenGr commented 1 month ago

Checking the documentation it seems that you can access the embedding layer as follows: https://spacy.io/api/curatedtransformer#doctransformeroutput-embeddinglayer. Which we can then perhaps use to average all tokens in order to create an embedding for the entire document. Having said that, it would be preferred if we could perhaps find the [cls] token to use but I cannot seem to find it in the documentation.