Closed chrisammon3000 closed 1 year ago
I believe spacy has changed the way it handles empty documents when creating vectors which was not accounted for in BERTopic. I'll have to do some more research to see if the issue can be handled better.
I see you referenced a fix in this commit: https://github.com/MaartenGr/BERTopic/commit/a7927a2f7c3d18701ad275bdc232d00a21ca8baa
But looking at whats on master - I don't see the fix there. Am I looking in the wrong place? https://github.com/MaartenGr/BERTopic/blob/master/bertopic/backend/_spacy.py#L80-L92
The reason I ask is because I ran into this error today on version 0.14.1
@metasyn Could you create a reproducible example with 0.14.1? That way, it becomes a bit easier to see what exactly is happening here.
Totally, I should've done that initially.
import sys
from typing import List
import bertopic
import cupy
import en_core_web_lg
import spacy
# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()
def log_versions():
print(f"python version: {sys.version}")
print(f"bertopic version: {bertopic.__version__}")
print(f"spacy version: {spacy.__version__}")
print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
print(f"CUDA 11.7 - cupy version: {cupy.__version__}")
def get_sample_input():
"""From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
return """
The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
Ocean. The structure links the U.S. city of San Francisco, California—the
northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
Route 101 and California State Route 1 across the strait. It also carries
pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
95. Recognized by the American Society of Civil Engineers as one of the Wonders
of the Modern World,[7] the bridge is one of the most internationally
recognized symbols of San Francisco and California.
The idea of a fixed link between San Francisco and Marin had gained increasing
popularity during the late 19th century, but it was not until the early 20th
century that such a link became feasible. Joseph Strauss served as chief
engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
making significant contributions to its design. The bridge opened to the public
in 1937 and has undergone various retrofits and other improvement projects in
the decades since.
The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
most beautiful, certainly the most photographed, bridge in the world."[8][9] At
the time of its opening in 1937, it was both the longest and the tallest
suspension bridge in the world, titles it held until 1964 and 1998
respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
feet (227 m).[10]
"""
def filter_func(doc: spacy.tokens.Doc) -> List[str]:
return [
token.lemma_.lower()
for token in doc
if len(token.text) > 2 # acronyms, typos
and not token.is_stop # stop words
and not token.is_punct # punctuation
]
def get_word_lists(nlp: spacy.Language, text: str) -> List[str]:
return [" ".join(filter_func(s.as_doc())) for s in nlp(text).sents]
def repro():
nlp = spacy.load("en_core_web_lg")
text = get_sample_input()
word_lists = get_word_lists(nlp, text)
print(word_lists)
# This is fine
topic_model = bertopic.BERTopic(embedding_model=nlp)
# The next line errors
topics, _ = topic_model.fit_transform(word_lists)
print(topics)
if __name__ == "__main__":
log_versions()
repro()
Gives me:
python version: 3.10.10 (main, Apr 3 2023, 08:04:30) [GCC 11.3.0]
bertopic version: 0.14.1
spacy version: 3.4.4
en_core_web_lg - spacy model version: 3.4.1
CUDA 11.7 - cupy version: 10.6.0
['\n golden gate bridge suspension bridge span golden gate \n mile wide 1.6 strait connect san francisco bay pacific \n ocean', 'structure link u.s. city san francisco california \n northern tip san francisco peninsula marin county carry u.s. \n route 101 california state route strait', 'carry \n pedestrian bicycle traffic designate u.s. bicycle route \n ', 'recognize american society civil engineers wonders \n modern world,[7 bridge internationally \n recognize symbol san francisco california \n\n ', 'idea fix link san francisco marin gain increase \n popularity late 19th century early 20th \n century link feasible', 'joseph strauss serve chief \n engineer project leon moisseiff irving morrow charles ellis \n make significant contribution design', 'bridge open public \n 1937 undergo retrofit improvement project \n decade \n\n ', 'golden gate bridge describe frommer travel guide possibly \n beautiful certainly photograph bridge world "[8][9', '\n time opening 1937 long tall \n suspension bridge world title hold 1964 1998 \n respectively', 'main span 4,200 foot 1,280 total height 746 \n foot 227 m).[10 \n\n ']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 80
78 if __name__ == "__main__":
79 log_versions()
---> 80 repro()
Cell In[2], line 74, in repro()
71 topic_model = bertopic.BERTopic(embedding_model=nlp)
73 # The next line errors
---> 74 topics, _ = topic_model.fit_transform(word_lists)
75 print(topics)
File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:344, in BERTopic.fit_transform(self, documents, embeddings, y)
341 if embeddings is None:
342 self.embedding_model = select_backend(self.embedding_model,
343 language=self.language)
--> 344 embeddings = self._extract_embeddings(documents.Document,
345 method="document",
346 verbose=self.verbose)
347 logger.info("Transformed documents to Embeddings")
348 else:
File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:2828, in BERTopic._extract_embeddings(self, documents, method, verbose)
2826 embeddings = self.embedding_model.embed_words(documents, verbose)
2827 elif method == "document":
-> 2828 embeddings = self.embedding_model.embed_documents(documents, verbose)
2829 else:
2830 raise ValueError("Wrong method for extracting document/word embeddings. "
2831 "Either choose 'word' or 'document' as the method. ")
File /usr/local/lib/python3.10/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
55 def embed_documents(self,
56 document: List[str],
57 verbose: bool = False) -> np.ndarray:
58 """ Embed a list of n words into an n-dimensional
59 matrix of embeddings
60
(...)
67 that each have an embeddings size of `m`
68 """
---> 69 return self.embed(document, verbose)
File /usr/local/lib/python3.10/site-packages/bertopic/backend/_spacy.py:92, in SpacyBackend.embed(self, documents, verbose)
90 for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
91 embeddings.append(self.embedding_model(doc or empty_document).vector)
---> 92 embeddings = np.array(embeddings)
94 return embeddings
File cupy/_core/core.pyx:1397, in cupy._core.core.ndarray.__array__()
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
Are there additional details I can provide?
best, xander
Oh, I realized I can simplify that a bit, here is a more minimal repro:
import sys
import bertopic
import cupy
import en_core_web_lg
import spacy
# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()
def log_versions():
print(f"python version: {sys.version}")
print(f"bertopic version: {bertopic.__version__}")
print(f"spacy version: {spacy.__version__}")
print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
print(f"CUDA 11.7 - cupy version: {cupy.__version__}")
def get_word_lists():
"""From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
return """
The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
Ocean. The structure links the U.S. city of San Francisco, California—the
northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
Route 101 and California State Route 1 across the strait. It also carries
pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
95. Recognized by the American Society of Civil Engineers as one of the Wonders
of the Modern World,[7] the bridge is one of the most internationally
recognized symbols of San Francisco and California.
The idea of a fixed link between San Francisco and Marin had gained increasing
popularity during the late 19th century, but it was not until the early 20th
century that such a link became feasible. Joseph Strauss served as chief
engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
making significant contributions to its design. The bridge opened to the public
in 1937 and has undergone various retrofits and other improvement projects in
the decades since.
The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
most beautiful, certainly the most photographed, bridge in the world."[8][9] At
the time of its opening in 1937, it was both the longest and the tallest
suspension bridge in the world, titles it held until 1964 and 1998
respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
feet (227 m).[10]
""".split()
def repro():
nlp = spacy.load("en_core_web_lg")
word_lists = get_word_lists()
print(word_lists)
# This is fine
topic_model = bertopic.BERTopic(embedding_model=nlp)
# The next line errors
topics, _ = topic_model.fit_transform(word_lists)
print(topics)
if __name__ == "__main__":
log_versions()
repro()
I am not getting the error when I run your code on a CPU. I believe that en_core_web_lg
is actually a CPU-optimized model which might explain the error you are getting.
I am also not getting the error when running on a CPU. It seems you had this fix in earlier:
Is this an approach we could pursue?
@metasyn Yeah, that should solve the issue I think. It's strange though, I think something went wrong with merging branches there. If you have the time and want to do a PR, that would be greatly appreciated. Otherwise, I might have some time in the coming weeks to look at this.
Sounds good: I've opened a PR here https://github.com/MaartenGr/BERTopic/pull/1179
Working with BERTopic in a GPU Colab notebook running Python 3.7.14 trying to perform topic modeling on a document consisting of a single string 11412 characters long:
Code:
Error result:
When I break the document up by groups of three sentences, I get the same error:
Also finding the same error with the quickstart tutorial:
Is there some incompatibility in the environment, Python version or another package version like NumPy or CuPy that could be causing this? Or am I using it incorrectly?