Trying to use Zephyr for representation but facing error. But getting RuntimeError: Expected all tensors to be on the same device.

shivamtawari commented 2 months ago

Have you searched existing issues? 🔎

[ ] I have searched and found no existing issues

Desribe the bug

I am facing this issue when trying to use Zephyr for representation model.

2024-08-30 10:44:11,684 - BERTopic - Dimensionality - Completed ✓
2024-08-30 10:44:11,688 - BERTopic - Cluster - Start clustering the reduced embeddings
/usr/local/lib/python3.10/dist-packages/joblib/externals/loky/backend/fork_exec.py:38: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  pid = os.fork()
2024-08-30 10:44:17,485 - BERTopic - Cluster - Completed ✓
2024-08-30 10:44:17,498 - BERTopic - Representation - Extracting topics from clusters using representation models.
  0%|          | 0/66 [00:08<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-11-5511f129a54a>](https://localhost:8080/#) in <cell line: 16>()
     14 )
     15 
---> 16 topics, probs = topic_model.fit_transform(docs, embeddings)

13 frames
[/usr/local/lib/python3.10/dist-packages/transformers/generation/logits_process.py](https://localhost:8080/#) in __call__(self, input_ids, scores)
    351     @add_start_docstrings(LOGITS_PROCESSOR_INPUTS_DOCSTRING)
    352     def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
--> 353         score = torch.gather(scores, 1, input_ids)
    354 
    355         # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA_gather)

Reproduction

!pip install bertopic
!pip install pandas
!pip uninstall -y numpy
!pip install numpy==1.26.4
!pip install spacy
!pip install ctransformers[cuda]
!pip install --upgrade git+https://github.com/huggingface/transformers

%%capture

!pip install cudf-cu12 dask-cudf-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cugraph-cu12 --extra-index-url=https://pypi.nvidia.com
!pip install cupy-cuda12x -f https://pip.cupy.dev/aarch64

import pandas as pd
import numpy as np
import torch
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
from sentence_transformers import SentenceTransformer
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
from bertopic.representation import TextGeneration

df = pd.read_csv('data.csv')
docs = df['body'].tolist()

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0,  metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_samples=100, gen_min_span_tree=True, prediction_data=True, metric='euclidean', cluster_selection_method='eom')

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(vocabulary=vocab, stop_words="english", min_df=2, ngram_range=(1, 2))

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
    #context_length=512,
    #max_new_tokens=512
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1,
    device=device
)

zephyr = TextGeneration(generator, prompt=prompt, doc_length=10, tokenizer="char")
representation_model = {"Zephyr": zephyr}

topic_model = BERTopic(
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  top_n_words=10,
    calculate_probabilities=False,
  verbose=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

BERTopic Version

v0.16.3

shivamtawari commented 2 months ago

If I don't pass device = device in the pipeline it shows me that the model will be shifted to CPU.

generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1,
    device=device
)

MaartenGr commented 2 months ago

Have you tried following the documentation here? It shows that you do not have to use a device at all since the correct one should be automatically detected. Also, note that it might be easier to use LlamaCPP instead, for which you can find a tutorial here.

shivamtawari commented 2 months ago

I have already tried following the Zephyr documentation, but the code results in:

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU

Code:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline
#import torch

#device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-GGUF",
    model_file="zephyr-7b-alpha.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

prompt = """<|system|>You are a helpful, respectful and honest assistant for labeling topics..</s>
<|user|>
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.</s>
<|assistant|>"""

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1,
    #device=device
)

Output:

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Fetching 1 files: 100%
 1/1 [00:00<00:00,  4.51it/s]
config.json: 100%
 31.0/31.0 [00:00<00:00, 1.94kB/s]
Fetching 1 files: 100%
 1/1 [00:32<00:00, 32.80s/it]
zephyr-7b-alpha.Q4_K_M.gguf: 100%
 4.37G/4.37G [00:32<00:00, 143MB/s]
tokenizer_config.json: 100%
 1.43k/1.43k [00:00<00:00, 75.0kB/s]
tokenizer.model: 100%
 493k/493k [00:00<00:00, 1.54MB/s]
tokenizer.json: 100%
 1.80M/1.80M [00:00<00:00, 6.66MB/s]
added_tokens.json: 100%
 42.0/42.0 [00:00<00:00, 2.55kB/s]
special_tokens_map.json: 100%
 168/168 [00:00<00:00, 8.60kB/s]
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

MaartenGr commented 2 months ago

Hmmm, I'm not entirely sure what is happening. It might be worthwhile to check the official Transformers documentation to see how you could enable this properly. You can test it outside of BERTopic since BERTopic simply calls the pipeline and nothing more.

Note that I would advise using llama.cpp python instead. It should make all of this much easier.

shivamtawari commented 2 months ago

Thanks @MaartenGr! I was able to use llama.cpp. I will also check the official Transformers documentation and update here if I find anything new.

MaartenGr / BERTopic