MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.17k stars 764 forks source link

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

Closed chrisammon3000 closed 1 year ago

chrisammon3000 commented 2 years ago

Working with BERTopic in a GPU Colab notebook running Python 3.7.14 trying to perform topic modeling on a document consisting of a single string 11412 characters long:

>>> print(transcript)
Hello friends, it's me today. We're checking out some cool things that I learned on tik-tok I do learn a lot of things on tik-tok how to remove a weed on green green is golf grass It's fake grass, right? Is it golf grass? Wait, I'm starting to think that's real grass So you literally just cut out a hole remove the entire weed and then put it back in like it's a piece of cake That mama said you can't eat yet wipe away the crumbs wipe away the evidence That's really how they do it. ...

>>> len(transcript)
11412

Code:

import spacy
from bertopic import BERTopic

spacy.require_gpu()
nlp = spacy.load("en_core_web_lg", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

topic_model = BERTopic(embedding_model=nlp)

# passing input as an iterable
topics, probabilities = topic_model.fit_transform([transcript])

Error result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-49-6159c101b53a>](https://localhost:8080/#) in <module>
----> 1 topics, probabilities = topic_model.fit_transform([transcript])

3 frames
[/usr/local/lib/python3.7/dist-packages/bertopic/backend/_spacy.py](https://localhost:8080/#) in embed(self, documents, verbose)
     95                     vector = self.embedding_model("An empty document").vector
     96                 embeddings.append(vector)
---> 97             embeddings = np.array(embeddings)
     98 
     99         return embeddings

cupy/_core/core.pyx in cupy._core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

When I break the document up by groups of three sentences, I get the same error:

docs = [
  ". ".join(transcript.split(". ")[:3]),
  ". ".join(transcript.split(". ")[3:6]),
  ". ".join(transcript.split(". ")[6:9]),
  ". ".join(transcript.split(". ")[9:12]),
  ". ".join(transcript.split(". ")[12:15])
]

topics, probabilities = topic_model.fit_transform(docs)
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Also finding the same error with the quickstart tutorial:

from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model.fit_transform(docs[:10])

Is there some incompatibility in the environment, Python version or another package version like NumPy or CuPy that could be causing this? Or am I using it incorrectly?

MaartenGr commented 2 years ago

I believe spacy has changed the way it handles empty documents when creating vectors which was not accounted for in BERTopic. I'll have to do some more research to see if the issue can be handled better.

metasyn commented 1 year ago

I see you referenced a fix in this commit: https://github.com/MaartenGr/BERTopic/commit/a7927a2f7c3d18701ad275bdc232d00a21ca8baa

But looking at whats on master - I don't see the fix there. Am I looking in the wrong place? https://github.com/MaartenGr/BERTopic/blob/master/bertopic/backend/_spacy.py#L80-L92

The reason I ask is because I ran into this error today on version 0.14.1

MaartenGr commented 1 year ago

@metasyn Could you create a reproducible example with 0.14.1? That way, it becomes a bit easier to see what exactly is happening here.

metasyn commented 1 year ago

Totally, I should've done that initially.

import sys
from typing import List

import bertopic
import cupy
import en_core_web_lg
import spacy

# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()

def log_versions():
    print(f"python version: {sys.version}")
    print(f"bertopic version: {bertopic.__version__}")
    print(f"spacy version: {spacy.__version__}")
    print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
    print(f"CUDA 11.7 - cupy version: {cupy.__version__}")

def get_sample_input():
    """From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
    return """
        The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
        one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
        Ocean. The structure links the U.S. city of San Francisco, California—the
        northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
        Route 101 and California State Route 1 across the strait. It also carries
        pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
        95. Recognized by the American Society of Civil Engineers as one of the Wonders
        of the Modern World,[7] the bridge is one of the most internationally
        recognized symbols of San Francisco and California.

        The idea of a fixed link between San Francisco and Marin had gained increasing
        popularity during the late 19th century, but it was not until the early 20th
        century that such a link became feasible. Joseph Strauss served as chief
        engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
        making significant contributions to its design. The bridge opened to the public
        in 1937 and has undergone various retrofits and other improvement projects in
        the decades since.

        The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
        most beautiful, certainly the most photographed, bridge in the world."[8][9] At
        the time of its opening in 1937, it was both the longest and the tallest
        suspension bridge in the world, titles it held until 1964 and 1998
        respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
        feet (227 m).[10]

    """

def filter_func(doc: spacy.tokens.Doc) -> List[str]:
    return [
        token.lemma_.lower()
        for token in doc
        if len(token.text) > 2  # acronyms, typos
        and not token.is_stop  # stop words
        and not token.is_punct  # punctuation
    ]

def get_word_lists(nlp: spacy.Language, text: str) -> List[str]:
    return [" ".join(filter_func(s.as_doc())) for s in nlp(text).sents]

def repro():
    nlp = spacy.load("en_core_web_lg")
    text = get_sample_input()
    word_lists = get_word_lists(nlp, text)
    print(word_lists)

    # This is fine
    topic_model = bertopic.BERTopic(embedding_model=nlp)

    # The next line errors
    topics, _ = topic_model.fit_transform(word_lists)
    print(topics)

if __name__ == "__main__":
    log_versions()
    repro()

Gives me:

python version: 3.10.10 (main, Apr  3 2023, 08:04:30) [GCC 11.3.0]
bertopic version: 0.14.1
spacy version: 3.4.4
en_core_web_lg - spacy model version: 3.4.1
CUDA 11.7 - cupy version: 10.6.0
['\n         golden gate bridge suspension bridge span golden gate \n         mile wide 1.6 strait connect san francisco bay pacific \n         ocean', 'structure link u.s. city san francisco california \n         northern tip san francisco peninsula marin county carry u.s. \n         route 101 california state route strait', 'carry \n         pedestrian bicycle traffic designate u.s. bicycle route \n        ', 'recognize american society civil engineers wonders \n         modern world,[7 bridge internationally \n         recognize symbol san francisco california \n\n        ', 'idea fix link san francisco marin gain increase \n         popularity late 19th century early 20th \n         century link feasible', 'joseph strauss serve chief \n         engineer project leon moisseiff irving morrow charles ellis \n         make significant contribution design', 'bridge open public \n         1937 undergo retrofit improvement project \n         decade \n\n        ', 'golden gate bridge describe frommer travel guide possibly \n         beautiful certainly photograph bridge world "[8][9', '\n         time opening 1937 long tall \n         suspension bridge world title hold 1964 1998 \n         respectively', 'main span 4,200 foot 1,280 total height 746 \n         foot 227 m).[10 \n\n    ']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 80
     78 if __name__ == "__main__":
     79     log_versions()
---> 80     repro()

Cell In[2], line 74, in repro()
     71 topic_model = bertopic.BERTopic(embedding_model=nlp)
     73 # The next line errors
---> 74 topics, _ = topic_model.fit_transform(word_lists)
     75 print(topics)

File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:344, in BERTopic.fit_transform(self, documents, embeddings, y)
    341 if embeddings is None:
    342     self.embedding_model = select_backend(self.embedding_model,
    343                                           language=self.language)
--> 344     embeddings = self._extract_embeddings(documents.Document,
    345                                           method="document",
    346                                           verbose=self.verbose)
    347     logger.info("Transformed documents to Embeddings")
    348 else:

File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:2828, in BERTopic._extract_embeddings(self, documents, method, verbose)
   2826     embeddings = self.embedding_model.embed_words(documents, verbose)
   2827 elif method == "document":
-> 2828     embeddings = self.embedding_model.embed_documents(documents, verbose)
   2829 else:
   2830     raise ValueError("Wrong method for extracting document/word embeddings. "
   2831                      "Either choose 'word' or 'document' as the method. ")

File /usr/local/lib/python3.10/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
     55 def embed_documents(self,
     56                     document: List[str],
     57                     verbose: bool = False) -> np.ndarray:
     58     """ Embed a list of n words into an n-dimensional
     59     matrix of embeddings
     60
   (...)
     67         that each have an embeddings size of `m`
     68     """
---> 69     return self.embed(document, verbose)

File /usr/local/lib/python3.10/site-packages/bertopic/backend/_spacy.py:92, in SpacyBackend.embed(self, documents, verbose)
     90     for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
     91         embeddings.append(self.embedding_model(doc or empty_document).vector)
---> 92     embeddings = np.array(embeddings)
     94 return embeddings

File cupy/_core/core.pyx:1397, in cupy._core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Are there additional details I can provide?

best, xander

metasyn commented 1 year ago

Oh, I realized I can simplify that a bit, here is a more minimal repro:

import sys

import bertopic
import cupy
import en_core_web_lg
import spacy

# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()

def log_versions():
    print(f"python version: {sys.version}")
    print(f"bertopic version: {bertopic.__version__}")
    print(f"spacy version: {spacy.__version__}")
    print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
    print(f"CUDA 11.7 - cupy version: {cupy.__version__}")

def get_word_lists():
    """From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
    return """
        The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
        one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
        Ocean. The structure links the U.S. city of San Francisco, California—the
        northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
        Route 101 and California State Route 1 across the strait. It also carries
        pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
        95. Recognized by the American Society of Civil Engineers as one of the Wonders
        of the Modern World,[7] the bridge is one of the most internationally
        recognized symbols of San Francisco and California.

        The idea of a fixed link between San Francisco and Marin had gained increasing
        popularity during the late 19th century, but it was not until the early 20th
        century that such a link became feasible. Joseph Strauss served as chief
        engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
        making significant contributions to its design. The bridge opened to the public
        in 1937 and has undergone various retrofits and other improvement projects in
        the decades since.

        The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
        most beautiful, certainly the most photographed, bridge in the world."[8][9] At
        the time of its opening in 1937, it was both the longest and the tallest
        suspension bridge in the world, titles it held until 1964 and 1998
        respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
        feet (227 m).[10]

    """.split()

def repro():
    nlp = spacy.load("en_core_web_lg")
    word_lists = get_word_lists()
    print(word_lists)

    # This is fine
    topic_model = bertopic.BERTopic(embedding_model=nlp)

    # The next line errors
    topics, _ = topic_model.fit_transform(word_lists)
    print(topics)

if __name__ == "__main__":
    log_versions()
    repro()
MaartenGr commented 1 year ago

I am not getting the error when I run your code on a CPU. I believe that en_core_web_lg is actually a CPU-optimized model which might explain the error you are getting.

metasyn commented 1 year ago

I am also not getting the error when running on a CPU. It seems you had this fix in earlier:

https://github.com/MaartenGr/BERTopic/commit/a7927a2f7c3d18701ad275bdc232d00a21ca8baa#diff-06119c27943e751ff191ded5f03370df0e9e55afa3aeab96b8a2588ccb1cb6a0R97-R101

Is this an approach we could pursue?

MaartenGr commented 1 year ago

@metasyn Yeah, that should solve the issue I think. It's strange though, I think something went wrong with merging branches there. If you have the time and want to do a PR, that would be greatly appreciated. Otherwise, I might have some time in the coming weeks to look at this.

metasyn commented 1 year ago

Sounds good: I've opened a PR here https://github.com/MaartenGr/BERTopic/pull/1179