[Bug]: 一个意料之外的问题，从landb中加载的数值异常

shaoqing404 commented 3 months ago

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

错误信息： Query vector size 2048 does not match index column size 1024 错误情况：在本地通过graphrag命令行直接查询是正常的，这是否有可能和某些配置有关

Steps to reproduce

超过100万字的知识图谱即有该问题。修改的配置信息如图

Expected Behavior

这个错误不应当发生

GraphRAG Config Used


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key:
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat
  model_supports_json: true # recommended if this is available for your model.
  api_base: https://api.deepseek.com/v1
  max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  max_retries: 10
  max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  llm:
    api_key: 
    type: openai_embedding # or azure_openai_embedding
    model: embedding-3
    api_base: https://open.bigmodel.cn/api/paas/v4
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    #max_retries: 10
    #max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 600
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
  type: "chinese"

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_mapped_entities: 10
  top_k_relationships: 10
  max_tokens: 12000

global_search:
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 32

Logs and screenshots

Additional Information

GraphRAG Version:0.2.1
Operating System:windows
Python Version:3.12
Related Issues:

KylinMountain commented 3 months ago

你是不是index时候和查询时候用的不是一个embedding模型？不然怎么会出现嵌入长度不一致呢？

KylinMountain commented 3 months ago

此外你配置好settings.yaml不需要单独改webserver中llm 或者embedding配置最新代码会自动引用settings.yaml中相关配置

shaoqing404 commented 3 months ago

此外你配置好settings.yaml不需要单独改webserver中llm 或者embedding配置最新代码会自动引用settings.yaml中相关配置

我康康embedding，我用的bge。这个settings.ymal在windows下读不到对应的数据有没有可能是个例？2

shaoqing404 commented 3 months ago

此外你配置好settings.yaml不需要单独改webserver中llm 或者embedding配置最新代码会自动引用settings.yaml中相关配置

我康康embedding，我用的bge。这个settings.ymal在windows下读不到对应的数据有没有可能是个例？2

此外你配置好settings.yaml不需要单独改webserver中llm 或者embedding配置最新代码会自动引用settings.yaml中相关配置

输出不完，他把map中的东西拿出来以后，很容易碰到输出上限：

KylinMountain commented 3 months ago

你考虑设置 local query和global query的max tokens

KylinMountain commented 3 months ago

此外你配置好settings.yaml不需要单独改webserver中llm 或者embedding配置最新代码会自动引用settings.yaml中相关配置

我康康embedding，我用的bge。这个settings.ymal在windows下读不到对应的数据有没有可能是个例？2

应该是个例外我今天刚在windiws下部署过

KylinMountain / graphrag-server