MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.12k stars 763 forks source link

Failed import bertopic after install CUML #1039

Closed AlfanDindaR closed 1 year ago

AlfanDindaR commented 1 year ago

Hi @MaartenGr i get error while used cuml, this is my error found

TypeError                                 Traceback (most recent call last)
[<ipython-input-2-08334298937f>](https://localhost:8080/#) in <module>
----> 1 import bertopic

12 frames
[/usr/local/lib/python3.8/dist-packages/google/protobuf/descriptor.py](https://localhost:8080/#) in __new__(cls, name, full_name, index, number, type, cpp_type, label, default_value, message_type, enum_type, containing_type, is_extension, extension_scope, options, serialized_options, has_default_value, containing_oneof, json_name, file, create_key)
    558                 has_default_value=True, containing_oneof=None, json_name=None,
    559                 file=None, create_key=None):  # pylint: disable=redefined-builtin
--> 560       _message.Message._CheckCalledFromGeneratedFile()
    561       if is_extension:
    562         return _message.default_pool.FindExtensionByName(full_name)

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.11.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
cudf-cu11 23.2.0 requires protobuf==4.21, but you have protobuf 3.20.3 which is incompatible.

this error showing about protobuf conflict, bertopic need protobuf >= 3.9 but cuml need protobuf==4.21, how can i solve this issue?

MaartenGr commented 1 year ago

Could you share the full error message? I believe there are some frames missing that might be relevant. Also, BERTopic by itself does not use tensorflow but pytorch, so if you are not using a tensorflow-based embedding model, then it would be no problem to set protobuf to 4.2.1.

beckernick commented 1 year ago

RAPIDS pip packages are not compatible with the Tensorflow pip package due to Tensorflow's protobuf constraint, as noted on the RAPIDS pip page. To use cuML with Tensorflow, I recommend using conda environments and installing Tensorflow from the conda-forge channel (which relaxes the constraint) or a Docker container containing a Tensorflow package without the tight constraint (such as the one linked on the RAPIDS website).

AlfanDindaR commented 1 year ago

Could you share the full error message? I believe there are some frames missing that might be relevant. Also, BERTopic by itself does not use tensorflow but pytorch, so if you are not using a tensorflow-based embedding model, then it would be no problem to set protobuf to 4.2.1.

This is error found while i'm import bertopic after install CUML @MaartenGr

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-3-08334298937f>](https://localhost:8080/#) in <module>
----> 1 import bertopic

12 frames
[/usr/local/lib/python3.8/dist-packages/google/protobuf/descriptor.py](https://localhost:8080/#) in __new__(cls, name, full_name, index, number, type, cpp_type, label, default_value, message_type, enum_type, containing_type, is_extension, extension_scope, options, serialized_options, has_default_value, containing_oneof, json_name, file, create_key)
    558                 has_default_value=True, containing_oneof=None, json_name=None,
    559                 file=None, create_key=None):  # pylint: disable=redefined-builtin
--> 560       _message.Message._CheckCalledFromGeneratedFile()
    561       if is_extension:
    562         return _message.default_pool.FindExtensionByName(full_name)

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
AlfanDindaR commented 1 year ago

But after downgrade cuml into version 22.12 it's works well,

!pip install --quiet cudf-cu11==22.12 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
!pip install --quiet cuml-cu11==22.12 --extra-index-url=https://pypi.nvidia.com

but i found error again while training topic

Batches: 100%
32/32 [00:08<00:00, 8.27it/s]
2023-02-27 03:13:31,999 - BERTopic - Transformed documents to Embeddings
2023-02-27 03:13:32,121 - BERTopic - The dimensionality reduction algorithm did not contain the `y` parameter and therefore the `y` parameter was not used

this is my HDBSCAN and UMAP Code

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

cluster_model = HDBSCAN(
    min_cluster_size=10, 
    metric='euclidean', 
    cluster_selection_method='eom', 
    prediction_data=True
)  # Clustering Model

umap_model = UMAP(
    n_neighbors=15, 
    n_components=5, 
    min_dist=0.0,
)
AlfanDindaR commented 1 year ago

Sorry to ask again sir @MaartenGr , are we can used cuml library to create topic model on version 0.14?

MaartenGr commented 1 year ago

Yes, you should be able to use cuml with BERTopic 0.14.

MaartenGr commented 1 year ago

Due to inactivity, I'll be closing this issue. Let me know if you want me to re-open the issue!