chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
31.82k stars 5.55k forks source link

我想使用ali_text_splitter这个分词器,该怎么修改代码? #3845

Closed zmwstu closed 6 months ago

zmwstu commented 6 months ago

我想使用ali_text_splitter这个分词器,该怎么修改代码? 并且我想使用gpu进行分词,我的序号为0,1,2,3,4,5,6,7的显卡都可以同时使用。

dosubot[bot] commented 6 months ago

要在代码中使用ali_text_splitter分词器并利用GPU进行加速,你可以按照以下步骤进行:

  1. 导入AliTextSplitter

    from langchain.text_splitter.ali_text_splitter import AliTextSplitter
  2. 实例化AliTextSplitter

    ali_text_splitter = AliTextSplitter(pdf=True)
  3. 调用split_text方法进行分词

    text = "你的文本在这里"
    split_texts = ali_text_splitter.split_text(text)

关于使用GPU,AliTextSplitter默认在CPU上执行。要在GPU上执行,你需要在相关的pipeline函数中指定device参数。例如,使用单个GPU(如GPU 0)的情况下,可以这样设置:

p = pipeline(
    task="document-segmentation",
    model='damo/nlp_bert_document-segmentation_chinese-base',
    device="cuda:0")  # 指定使用GPU 0

如果你想同时使用多个GPU(例如,GPU 0到7),这通常需要更复杂的并行处理或分布式计算方法。这可能包括使用PyTorch或TensorFlow等库,它们提供了跨多个GPU分配和管理计算任务的能力。具体实现会涉及到将输入数据分割成多个部分,每部分在不同的GPU上处理,然后将结果汇总。这种方法超出了简单修改pipeline函数的范围,需要根据你的具体需求设计并实现。

#### Details **Sources** - [text_splitter现在支持了嘛](https://github.com/chatchat-space/Langchain-Chatchat/issues/1086) - [text_splitter/ali_text_splitter.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/text_splitter/ali_text_splitter.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

zmwstu commented 6 months ago

要在代码中使用ali_text_splitter分词器并利用GPU进行加速,你可以按照以下步骤进行:

  1. 导入AliTextSplitter
    from langchain.text_splitter.ali_text_splitter import AliTextSplitter
  2. 实例化AliTextSplitter
    ali_text_splitter = AliTextSplitter(pdf=True)
  3. 调用split_text方法进行分词
    text = "你的文本在这里"
    split_texts = ali_text_splitter.split_text(text)

关于使用GPU,AliTextSplitter默认在CPU上执行。要在GPU上执行,你需要在相关的pipeline函数中指定device参数。例如,使用单个GPU(如GPU 0)的情况下,可以这样设置:

p = pipeline(
    task="document-segmentation",
    model='damo/nlp_bert_document-segmentation_chinese-base',
    device="cuda:0")  # 指定使用GPU 0

如果你想同时使用多个GPU(例如,GPU 0到7),这通常需要更复杂的并行处理或分布式计算方法。这可能包括使用PyTorch或TensorFlow等库,它们提供了跨多个GPU分配和管理计算任务的能力。具体实现会涉及到将输入数据分割成多个部分,每部分在不同的GPU上处理,然后将结果汇总。这种方法超出了简单修改pipeline函数的范围,需要根据你的具体需求设计并实现。

Details

Sources

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

configs里的kb_config.py文件该怎么修改呢?

dosubot[bot] commented 6 months ago

To modify the kb_config.py file in the configs directory for using AliTextSplitter with GPU acceleration, you'll need to adjust settings related to the device configuration for the pipeline function. Specifically, you should look for any parameters or settings within kb_config.py that define the device used for computations (e.g., device="cpu").

Change the device parameter to use GPU by specifying device="cuda:x" where x is the index of the GPU you want to use. For example, to use the first GPU, you would set device="cuda:0". If you want to utilize all available GPUs, you can simply use device="cuda" which instructs PyTorch to distribute the workload across all GPUs.

Here's a general example of how you might adjust the device setting for GPU usage:

# Example modification in kb_config.py
# Change the device setting from CPU to GPU
device = "cuda:0"  # Use the first GPU
# or
device = "cuda"  # Use all available GPUs

Ensure that any part of your code, which initializes or calls the pipeline function for AliTextSplitter, references this configuration to apply the GPU settings. If kb_config.py directly influences the instantiation of AliTextSplitter or the pipeline, ensure these settings are correctly applied there.

Without specific details of kb_config.py, this guidance assumes that the file contains or influences settings related to device configuration for text processing tasks. If the file structure or its role in your project differs significantly, please provide more details for targeted advice.

#### Details **Sources** - [text_splitter/ali_text_splitter.py](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/text_splitter/ali_text_splitter.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.