Cinnamon / kotaemon

An open-source RAG-based tool for chatting with your documents.
https://cinnamon.github.io/kotaemon/
Apache License 2.0
14.52k stars 1.12k forks source link

[BUG] GraphRAG not using .env variables when indexing #394

Closed EvelynBai closed 1 week ago

EvelynBai commented 1 week ago

Description

I have reviewed previous issues and took steps like "export $(cat .env | xargs)", "dotenv run -- python app.py". However the GraphRAG still used gpt-4-turbo-preview llm instead of my settings GRAPHRAG_LLM_MODEL=gpt-4o-mini-2024-07-18 in .env file.

My .env file: `OPENAI_API_BASE=https://api.openai.com/v1 OPENAI_API_KEY= OPENAI_CHAT_MODEL=gpt-4o-mini-2024-07-18 OPENAI_EMBEDDINGS_MODEL=text-embedding-3-small

GRAPHRAG_API_KEY= GRAPHRAG_LLM_MODEL=gpt-4o-mini-2024-07-18 GRAPHRAG_EMBEDDING_MODEL=text-embedding-3-small GRAPHRAG_LLM_REQUEST_TIMEOUT=1800.0`

indexing-engine.log: 03:26:45,442 graphrag.index.cli INFO Using default configuration: { "llm": { "api_key": "==== REDACTED ====", "type": "openai_chat", "model": "gpt-4-turbo-preview", "max_tokens": 4000, "temperature": 0.0, "top_p": 1.0, "n": 1, "request_timeout": 1800.0, "api_base": null, "api_version": "2024-02-15-preview", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 } ... }`

Reproduction steps

1. conda activate kotaemon
2. cd kotaemon/
3. export $(cat .env | xargs)
4. dotenv run -- python app.py
5. Go to 'Resources' panel, correctly enter specification of the models (LLMs & Embeddings), click testing (success) and save
6. Go to 'Files - GraphRAG Collection' panel, upload file and start indexing
7. View logs, the llm is "gpt-4-turbo-preview"

Screenshots

No response

Logs

03:26:45,439 graphrag.index.cli INFO Logging enabled at /home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f/output/indexing-engine.log
03:26:45,441 graphrag.index.cli INFO Starting pipeline run for: 20241015-032645, dryrun=False
03:26:45,442 graphrag.index.cli INFO Using default configuration: {
    "llm": {
        "api_key": "==== REDACTED ====",
        "type": "openai_chat",
        "model": "gpt-4-turbo-preview",
        "max_tokens": 4000,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "request_timeout": 1800.0,
        "api_base": null,
        "api_version": "2024-02-15-preview",
        "proxy": null,
        "cognitive_services_endpoint": null,
        "deployment_name": null,
        "model_supports_json": true,
        "tokens_per_minute": 0,
        "requests_per_minute": 0,
        "max_retries": 10,
        "max_retry_wait": 10.0,
        "sleep_on_rate_limit_recommendation": true,
        "concurrent_requests": 25
    },
    "parallelization": {
        "stagger": 0.3,
        "num_threads": 50
    },
    "async_mode": "threaded",
    "root_dir": "/home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f",
    "reporting": {
        "type": "file",
        "base_dir": "/home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f/output",
        "storage_account_blob_url": null
    },
    "storage": {
        "type": "file",
        "base_dir": "/home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f/output",
        "storage_account_blob_url": null
    },
    "cache": {
        "type": "file",
        "base_dir": "cache",
        "storage_account_blob_url": null
    },
    "input": {
        "type": "file",
        "file_type": "text",
        "base_dir": "input",
        "storage_account_blob_url": null,
        "encoding": "utf-8",
        "file_pattern": ".*\\.txt$",
        "file_filter": null,
        "source_column": null,
        "timestamp_column": null,
        "timestamp_format": null,
        "text_column": "text",
        "title_column": null,
        "document_attribute_columns": []
    },
    "embed_graph": {
        "enabled": false,
        "num_walks": 10,
        "walk_length": 40,
        "window_size": 2,
        "iterations": 3,
        "random_seed": 597832,
        "strategy": null
    },
    "embeddings": {
        "llm": {
            "api_key": "==== REDACTED ====",
            "type": "openai_embedding",
            "model": "text-embedding-3-small",
            "max_tokens": 4000,
            "temperature": 0,
            "top_p": 1,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": null,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "batch_size": 16,
        "batch_max_tokens": 8191,
        "target": "required",
        "skip": [],
        "vector_store": null,
        "strategy": null
    },
    "chunks": {
        "size": 1200,
        "overlap": 100,
        "group_by_columns": [
            "id"
        ],
        "strategy": null,
        "encoding_model": null
    },
    "snapshots": {
        "graphml": false,
        "raw_entities": false,
        "top_level_nodes": false
    },
    "entity_extraction": {
        "llm": {
            "api_key": "==== REDACTED ====",
            "type": "openai_chat",
            "model": "gpt-4-turbo-preview",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 1800.0,
            "api_base": null,
            "api_version": "2024-02-15-preview",
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/entity_extraction.txt",
        "entity_types": [
            "organization",
            "person",
            "geo",
            "event"
        ],
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "summarize_descriptions": {
        "llm": {
            "api_key": "==== REDACTED ====",
            "type": "openai_chat",
            "model": "gpt-4-turbo-preview",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 1800.0,
            "api_base": null,
            "api_version": "2024-02-15-preview",
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/summarize_descriptions.txt",
        "max_length": 500,
        "strategy": null
    },
    "community_reports": {
        "llm": {
            "api_key": "==== REDACTED ====",
            "type": "openai_chat",
            "model": "gpt-4-turbo-preview",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 1800.0,
            "api_base": null,
            "api_version": "2024-02-15-preview",
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/community_report.txt",
        "max_length": 2000,
        "max_input_length": 8000,
        "strategy": null
    },
    "claim_extraction": {
        "llm": {
            "api_key": "==== REDACTED ====",
            "type": "openai_chat",
            "model": "gpt-4-turbo-preview",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 1800.0,
            "api_base": null,
            "api_version": "2024-02-15-preview",
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "enabled": false,
        "prompt": "prompts/claim_extraction.txt",
        "description": "Any claims or facts that could be relevant to information discovery.",
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "cluster_graph": {
        "max_cluster_size": 10,
        "strategy": null
    },
    "umap": {
        "enabled": false
    },
    "local_search": {
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "top_k_entities": 10,
        "top_k_relationships": 10,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "llm_max_tokens": 2000
    },
    "global_search": {
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "data_max_tokens": 12000,
        "map_max_tokens": 1000,
        "reduce_max_tokens": 2000,
        "concurrency": 32
    },
    "encoding_model": "cl100k_base",
    "skip_workflows": []
}
03:26:45,443 graphrag.index.create_pipeline_config INFO skipping workflows 
03:26:45,443 graphrag.index.run.run INFO Running pipeline
03:26:45,443 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at /home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f/output
03:26:45,443 graphrag.index.input.load_input INFO loading input from root_dir=input
03:26:45,443 graphrag.index.input.load_input INFO using file storage for input
03:26:45,443 graphrag.index.storage.file_pipeline_storage INFO search /home/ws_0802/BYF/kotaemon/ktem_app_data/user_data/files/graphrag/7d5628be-fb4e-47d9-9892-90aaa8adce7f/input for files matching .*\.txt$
03:26:45,443 graphrag.index.input.text INFO found text files from input, found [('0cfd34bb-e3c0-4394-a55f-52a44d09e962.txt', {})]
03:26:45,444 graphrag.index.input.text INFO Found 1 files, loading 1
03:26:45,445 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'create_final_relationships', 'create_final_text_units', 'create_final_community_reports', 'create_base_documents', 'create_final_documents']
03:26:45,446 graphrag.index.run.run INFO Final # of rows loaded: 1
03:26:45,594 graphrag.index.run.workflow INFO dependencies for create_base_text_units: []
03:26:45,594 datashaper.workflow.workflow INFO executing verb orderby
03:26:45,595 datashaper.workflow.workflow INFO executing verb zip
03:26:45,595 datashaper.workflow.workflow INFO executing verb aggregate_override
03:26:45,597 datashaper.workflow.workflow INFO executing verb chunk
03:26:45,733 datashaper.workflow.workflow INFO executing verb select
03:26:45,733 datashaper.workflow.workflow INFO executing verb unroll
03:26:45,734 datashaper.workflow.workflow INFO executing verb rename
03:26:45,735 datashaper.workflow.workflow INFO executing verb genid
03:26:45,735 datashaper.workflow.workflow INFO executing verb unzip
03:26:45,736 datashaper.workflow.workflow INFO executing verb copy
03:26:45,736 datashaper.workflow.workflow INFO executing verb filter
03:26:45,740 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_text_units.parquet
03:26:45,892 graphrag.index.run.workflow INFO dependencies for create_base_extracted_entities: ['create_base_text_units']
03:26:45,892 graphrag.utils.storage INFO read table from storage: create_base_text_units.parquet
03:26:45,897 datashaper.workflow.workflow INFO executing verb entity_extract
03:26:45,897 graphrag.llm.openai.create_openai_client INFO Creating OpenAI client base_url=None
03:26:45,916 graphrag.index.llm.load_llm INFO create TPM/RPM limiter for gpt-4-turbo-preview: TPM=0, RPM=0
03:26:45,916 graphrag.index.llm.load_llm INFO create concurrency limiter for gpt-4-turbo-preview: 25
03:26:46,135 graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\n-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n \n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity, capitalized\n- entity_type: One of the following types: [organization,person,geo,event]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"<|><entity_name><|><entity_type><|><entity_description>)\n \n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n Format each relationship as ("relationship"<|><source_entity><|><target_entity><|><relationship_description><|><relationship_strength>)\n \n3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.\n \n4. When finished, output <|COMPLETE|>\n \n######################\n-Examples-\n######################\nExample 1:\nEntity_types: ORGANIZATION,PERSON\nText:\nThe Verdantis\'s Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.\n######################\nOutput:\n("entity"<|>CENTRAL INSTITUTION<|>ORGANIZATION<|>The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)\n##\n("entity"<|>MARTIN SMITH<|>PERSON<|>Martin Smith is the chair of the Central Institution)\n##\n("entity"<|>MARKET STRATEGY COMMITTEE<|>ORGANIZATION<|>The Central Institution committee makes key decisions about interest rates and the growth of Verdantis\'s money supply)\n##\n("relationship"<|>MARTIN SMITH<|>CENTRAL INSTITUTION<|>Martin Smith is the Chair of the Central Institution and will answer questions at a press conference<|>9)\n<|COMPLETE|>\n\n######################\nExample 2:\nEntity_types: ORGANIZATION\nText:\nTechGlobal\'s (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation\'s debut on the public markets isn\'t indicative of how other newly listed companies may perform.\n\nTechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.\n######################\nOutput:\n("entity"<|>TECHGLOBAL<|>ORGANIZATION<|>TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)\n##\n("entity"<|>VISION HOLDINGS<|>ORGANIZATION<|>Vision Holdings is a firm that previously owned TechGlobal)\n##\n("relationship"<|>TECHGLOBAL<|>VISION HOLDINGS<|>Vision Holdings formerly owned TechGlobal from 2014 until present<|>5)\n<|COMPLETE|>\n\n######################\nExample 3:\nEntity_types: ORGANIZATION,GEO,PERSON\nText:\nFive Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.\n\nThe swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.\n\nThe exchange initiated in Firuzabad\'s capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.\n\nThey were welcomed by senior Aurelian officials and are now on their way to Aurelia\'s capital, Cashion.\n\nThe Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia\'s Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.\n######################\nOutput:\n("entity"<|>FIRUZABAD<|>GEO<|>Firuzabad held Aurelians as hostages)\n##\n("entity"<|>AURELIA<|>GEO<|>Country seeking to release hostages)\n##\n("entity"<|>QUINTARA<|>GEO<|>Country that negotiated a swap of money in exchange for hostages)\n##\n##\n("entity"<|>TIRUZIA<|>GEO<|>Capital of Firuzabad where the Aurelians were being held)\n##\n("entity"<|>KROHAARA<|>GEO<|>Capital city in Quintara)\n##\n("entity"<|>CASHION<|>GEO<|>Capital city in Aurelia)\n##\n("entity"<|>SAMUEL NAMARA<|>PERSON<|>Aurelian who spent time in Tiruzia\'s Alhamia Prison)\n##\n("entity"<|>ALHAMIA PRISON<|>GEO<|>Prison in Tiruzia)\n##\n("entity"<|>DURKE BATAGLANI<|>PERSON<|>Aurelian journalist who was held hostage)\n##\n("entity"<|>MEGGIE TAZBAH<|>PERSON<|>Bratinas national and environmentalist who was held hostage)\n##\n("relationship"<|>FIRUZABAD<|>AURELIA<|>Firuzabad negotiated a hostage exchange with Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>AURELIA<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>FIRUZABAD<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>ALHAMIA PRISON<|>Samuel Namara was a prisoner at Alhamia prison<|>8)\n##\n("relationship"<|>SAMUEL NAMARA<|>MEGGIE TAZBAH<|>Samuel Namara and Meggie Tazbah were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>DURKE BATAGLANI<|>Samuel Namara and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>DURKE BATAGLANI<|>Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>FIRUZABAD<|>Samuel Namara was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>FIRUZABAD<|>Meggie Tazbah was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>DURKE BATAGLANI<|>FIRUZABAD<|>Durke Bataglani was a hostage in Firuzabad<|>2)\n<|COMPLETE|>\n\n######################\n-Real Data-\n######################\nEntity_types: organization,person,geo,event\nText: Finance refers to monetary resources and to the study and discipline of money, currency, assets and liabilities.As a subject of study, it is related to but distinct from economics, which is the study of the production, distribution, and consumption of goods and services.Based on the scope of financial activities in financial systems, the discipline can be divided into personal, corporate, and public finance.\n######################\nOutput:'}
03:26:47,505 graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\n-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n \n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity, capitalized\n- entity_type: One of the following types: [organization,person,geo,event]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"<|><entity_name><|><entity_type><|><entity_description>)\n \n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n Format each relationship as ("relationship"<|><source_entity><|><target_entity><|><relationship_description><|><relationship_strength>)\n \n3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.\n \n4. When finished, output <|COMPLETE|>\n \n######################\n-Examples-\n######################\nExample 1:\nEntity_types: ORGANIZATION,PERSON\nText:\nThe Verdantis\'s Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.\n######################\nOutput:\n("entity"<|>CENTRAL INSTITUTION<|>ORGANIZATION<|>The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)\n##\n("entity"<|>MARTIN SMITH<|>PERSON<|>Martin Smith is the chair of the Central Institution)\n##\n("entity"<|>MARKET STRATEGY COMMITTEE<|>ORGANIZATION<|>The Central Institution committee makes key decisions about interest rates and the growth of Verdantis\'s money supply)\n##\n("relationship"<|>MARTIN SMITH<|>CENTRAL INSTITUTION<|>Martin Smith is the Chair of the Central Institution and will answer questions at a press conference<|>9)\n<|COMPLETE|>\n\n######################\nExample 2:\nEntity_types: ORGANIZATION\nText:\nTechGlobal\'s (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation\'s debut on the public markets isn\'t indicative of how other newly listed companies may perform.\n\nTechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.\n######################\nOutput:\n("entity"<|>TECHGLOBAL<|>ORGANIZATION<|>TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)\n##\n("entity"<|>VISION HOLDINGS<|>ORGANIZATION<|>Vision Holdings is a firm that previously owned TechGlobal)\n##\n("relationship"<|>TECHGLOBAL<|>VISION HOLDINGS<|>Vision Holdings formerly owned TechGlobal from 2014 until present<|>5)\n<|COMPLETE|>\n\n######################\nExample 3:\nEntity_types: ORGANIZATION,GEO,PERSON\nText:\nFive Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.\n\nThe swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.\n\nThe exchange initiated in Firuzabad\'s capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.\n\nThey were welcomed by senior Aurelian officials and are now on their way to Aurelia\'s capital, Cashion.\n\nThe Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia\'s Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.\n######################\nOutput:\n("entity"<|>FIRUZABAD<|>GEO<|>Firuzabad held Aurelians as hostages)\n##\n("entity"<|>AURELIA<|>GEO<|>Country seeking to release hostages)\n##\n("entity"<|>QUINTARA<|>GEO<|>Country that negotiated a swap of money in exchange for hostages)\n##\n##\n("entity"<|>TIRUZIA<|>GEO<|>Capital of Firuzabad where the Aurelians were being held)\n##\n("entity"<|>KROHAARA<|>GEO<|>Capital city in Quintara)\n##\n("entity"<|>CASHION<|>GEO<|>Capital city in Aurelia)\n##\n("entity"<|>SAMUEL NAMARA<|>PERSON<|>Aurelian who spent time in Tiruzia\'s Alhamia Prison)\n##\n("entity"<|>ALHAMIA PRISON<|>GEO<|>Prison in Tiruzia)\n##\n("entity"<|>DURKE BATAGLANI<|>PERSON<|>Aurelian journalist who was held hostage)\n##\n("entity"<|>MEGGIE TAZBAH<|>PERSON<|>Bratinas national and environmentalist who was held hostage)\n##\n("relationship"<|>FIRUZABAD<|>AURELIA<|>Firuzabad negotiated a hostage exchange with Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>AURELIA<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>FIRUZABAD<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>ALHAMIA PRISON<|>Samuel Namara was a prisoner at Alhamia prison<|>8)\n##\n("relationship"<|>SAMUEL NAMARA<|>MEGGIE TAZBAH<|>Samuel Namara and Meggie Tazbah were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>DURKE BATAGLANI<|>Samuel Namara and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>DURKE BATAGLANI<|>Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>FIRUZABAD<|>Samuel Namara was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>FIRUZABAD<|>Meggie Tazbah was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>DURKE BATAGLANI<|>FIRUZABAD<|>Durke Bataglani was a hostage in Firuzabad<|>2)\n<|COMPLETE|>\n\n######################\n-Real Data-\n######################\nEntity_types: organization,person,geo,event\nText: Finance refers to monetary resources and to the study and discipline of money, currency, assets and liabilities.As a subject of study, it is related to but distinct from economics, which is the study of the production, distribution, and consumption of goods and services.Based on the scope of financial activities in financial systems, the discipline can be divided into personal, corporate, and public finance.\n######################\nOutput:'}
03:26:54,916 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
03:26:54,921 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 2 retries took 5.213590767234564. input_tokens=1811, output_tokens=132
03:26:56,126 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
03:26:56,128 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 1.2053578100167215. input_tokens=34, output_tokens=5
03:26:56,131 datashaper.workflow.workflow INFO executing verb merge_graphs
03:26:56,135 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_extracted_entities.parquet
03:26:56,294 graphrag.index.run.workflow INFO dependencies for create_summarized_entities: ['create_base_extracted_entities']
03:26:56,294 graphrag.utils.storage INFO read table from storage: create_base_extracted_entities.parquet
03:26:56,297 datashaper.workflow.workflow INFO executing verb summarize_descriptions
03:26:56,299 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_summarized_entities.parquet
03:26:56,448 graphrag.index.run.workflow INFO dependencies for create_base_entity_graph: ['create_summarized_entities']
03:26:56,448 graphrag.utils.storage INFO read table from storage: create_summarized_entities.parquet
03:26:56,451 datashaper.workflow.workflow INFO executing verb cluster_graph
03:26:56,455 datashaper.workflow.workflow INFO executing verb select
03:26:56,456 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_entity_graph.parquet
03:26:56,605 graphrag.index.run.workflow INFO dependencies for create_final_entities: ['create_base_entity_graph']
03:26:56,605 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet
03:26:56,608 datashaper.workflow.workflow INFO executing verb unpack_graph
03:26:56,609 datashaper.workflow.workflow INFO executing verb rename
03:26:56,610 datashaper.workflow.workflow INFO executing verb select
03:26:56,610 datashaper.workflow.workflow INFO executing verb dedupe
03:26:56,610 datashaper.workflow.workflow INFO executing verb rename
03:26:56,611 datashaper.workflow.workflow INFO executing verb filter
03:26:56,613 datashaper.workflow.workflow INFO executing verb text_split
03:26:56,613 datashaper.workflow.workflow INFO executing verb drop
03:26:56,614 datashaper.workflow.workflow INFO executing verb merge
03:26:56,615 datashaper.workflow.workflow INFO executing verb text_embed
03:26:56,615 graphrag.llm.openai.create_openai_client INFO Creating OpenAI client base_url=None
03:26:56,634 graphrag.index.llm.load_llm INFO create TPM/RPM limiter for text-embedding-3-small: TPM=0, RPM=0
03:26:56,634 graphrag.index.llm.load_llm INFO create concurrency limiter for text-embedding-3-small: 25
03:26:56,634 graphrag.index.verbs.text.embed.strategies.openai INFO embedding 2 inputs via 2 snippets using 1 batches. max_batch_size=16, max_tokens=8191
03:26:56,849 graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': ['FINANCE:Finance encompasses the management, creation, and study of money, banking, credit, investments, assets, and liabilities', 'ECONOMICS:Economics is a social science concerned with the production, distribution, and consumption of goods and services']}
03:26:58,110 graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': ['FINANCE:Finance encompasses the management, creation, and study of money, banking, credit, investments, assets, and liabilities', 'ECONOMICS:Economics is a social science concerned with the production, distribution, and consumption of goods and services']}
03:27:01,410 httpx INFO HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
03:27:01,471 graphrag.llm.base.rate_limiting_llm INFO perf - llm.embedding "Process" with 2 retries took 0.761308525223285. input_tokens=48, output_tokens=0
03:27:01,476 datashaper.workflow.workflow INFO executing verb drop
03:27:01,477 datashaper.workflow.workflow INFO executing verb filter
03:27:01,482 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_entities.parquet
03:27:01,647 graphrag.index.run.workflow INFO dependencies for create_final_nodes: ['create_base_entity_graph']
03:27:01,648 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet
03:27:01,651 datashaper.workflow.workflow INFO executing verb layout_graph
03:27:01,653 datashaper.workflow.workflow INFO executing verb unpack_graph
03:27:01,654 datashaper.workflow.workflow INFO executing verb unpack_graph
03:27:01,655 datashaper.workflow.workflow INFO executing verb drop
03:27:01,655 datashaper.workflow.workflow INFO executing verb filter
03:27:01,657 datashaper.workflow.workflow INFO executing verb select
03:27:01,658 datashaper.workflow.workflow INFO executing verb rename
03:27:01,658 datashaper.workflow.workflow INFO executing verb join
03:27:01,662 datashaper.workflow.workflow INFO executing verb convert
03:27:01,663 datashaper.workflow.workflow INFO executing verb rename
03:27:01,664 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_nodes.parquet
03:27:01,820 graphrag.index.run.workflow INFO dependencies for create_final_communities: ['create_base_entity_graph']
03:27:01,820 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet
03:27:01,823 datashaper.workflow.workflow INFO executing verb create_final_communities
03:27:01,832 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_communities.parquet
03:27:01,983 graphrag.index.run.workflow INFO dependencies for create_final_relationships: ['create_final_nodes', 'create_base_entity_graph']
03:27:01,984 graphrag.utils.storage INFO read table from storage: create_final_nodes.parquet
03:27:01,988 graphrag.utils.storage INFO read table from storage: create_base_entity_graph.parquet
03:27:01,991 datashaper.workflow.workflow INFO executing verb create_final_relationships_pre_embedding
03:27:01,992 datashaper.workflow.workflow INFO executing verb create_final_relationships_post_embedding
03:27:01,996 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_relationships.parquet
03:27:02,150 graphrag.index.run.workflow INFO dependencies for create_final_text_units: ['create_base_text_units', 'create_final_relationships', 'create_final_entities']
03:27:02,150 graphrag.utils.storage INFO read table from storage: create_base_text_units.parquet
03:27:02,153 graphrag.utils.storage INFO read table from storage: create_final_relationships.parquet
03:27:02,156 graphrag.utils.storage INFO read table from storage: create_final_entities.parquet
03:27:02,158 datashaper.workflow.workflow INFO executing verb create_final_text_units_pre_embedding
03:27:02,166 datashaper.workflow.workflow INFO executing verb select
03:27:02,167 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_text_units.parquet
03:27:02,320 graphrag.index.run.workflow INFO dependencies for create_final_community_reports: ['create_final_nodes', 'create_final_relationships']
03:27:02,320 graphrag.utils.storage INFO read table from storage: create_final_nodes.parquet
03:27:02,324 graphrag.utils.storage INFO read table from storage: create_final_relationships.parquet
03:27:02,326 datashaper.workflow.workflow INFO executing verb prepare_community_reports_nodes
03:27:02,327 datashaper.workflow.workflow INFO executing verb prepare_community_reports_edges
03:27:02,328 datashaper.workflow.workflow INFO executing verb restore_community_hierarchy
03:27:02,330 datashaper.workflow.workflow INFO executing verb prepare_community_reports
03:27:02,331 graphrag.index.verbs.graph.report.prepare_community_reports INFO Number of nodes at level=0 => 2
03:27:02,341 datashaper.workflow.workflow INFO executing verb create_community_reports
03:27:14,63 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
03:27:14,65 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "create_community_report" with 0 retries took 11.720063698943704. input_tokens=2001, output_tokens=364
03:27:14,66 datashaper.workflow.workflow INFO executing verb window
03:27:14,68 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_community_reports.parquet
03:27:14,222 graphrag.index.run.workflow INFO dependencies for create_base_documents: ['create_final_text_units']
03:27:14,223 graphrag.utils.storage INFO read table from storage: create_final_text_units.parquet
03:27:14,226 datashaper.workflow.workflow INFO executing verb unroll
03:27:14,227 datashaper.workflow.workflow INFO executing verb select
03:27:14,228 datashaper.workflow.workflow INFO executing verb rename
03:27:14,228 datashaper.workflow.workflow INFO executing verb join
03:27:14,231 datashaper.workflow.workflow INFO executing verb aggregate_override
03:27:14,232 datashaper.workflow.workflow INFO executing verb join
03:27:14,236 datashaper.workflow.workflow INFO executing verb rename
03:27:14,236 datashaper.workflow.workflow INFO executing verb convert
03:27:14,237 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_documents.parquet
03:27:14,389 graphrag.index.run.workflow INFO dependencies for create_final_documents: ['create_base_documents']
03:27:14,390 graphrag.utils.storage INFO read table from storage: create_base_documents.parquet
03:27:14,393 datashaper.workflow.workflow INFO executing verb rename
03:27:14,394 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_documents.parquet
03:27:14,417 graphrag.index.cli INFO All workflows completed successfully.

Browsers

Chrome

OS

Linux

Additional information

No response

taprosoft commented 1 week ago

@EvelynBai In this case you have to set USE_CUSTOMIZED_GRAPHRAG_SETTING=true and edit your custom model in settings.yaml.example https://github.com/Cinnamon/kotaemon#setup-graphrag

Currently this is an issue of GraphRAG not respecting the GRAPHRAG_LLM_MODEL and GRAPHRAG_EMBEDDING_MODEL but using default model names in the config file.

We will figure out how to make the setup more seamlessly in the future with custom GraphRAG implementation.

ronchengang commented 1 week ago

The reason for this problem is that the graphrag index is called through subprocess. Under this calling method, the environment variables in .env cannot be passed to the command in the subprocess.

here is my solution for you reference.

  1. Prepare a dict
  2. Write the environment variables that need to be passed to graphrag in the dict, and you can get its value from .env
  3. In the subprocess, pass the environment variables in through the env parameter
# get env values from .env file
env={"GRAPHRAG_LLM_MODEL", config("GRAPHRAG_LLM_MODEL", default="gpt-4-turbo-preview")}

# Run the command and stream stdout
with subprocess.Popen(command, stdout=subprocess.PIPE, text=True, env=env) as process:
  if process.stdout:
    for line in process.stdout:
      yield Document(channel="debug", text=line)
EvelynBai commented 1 week ago

@EvelynBai In this case you have to set USE_CUSTOMIZED_GRAPHRAG_SETTING=true and edit your custom model in settings.yaml.example https://github.com/Cinnamon/kotaemon#setup-graphrag

Currently this is an issue of GraphRAG not respecting the GRAPHRAG_LLM_MODEL and GRAPHRAG_EMBEDDING_MODEL but using default model names in the config file.

We will figure out how to make the setup more seamlessly in the future with custom GraphRAG implementation.

Thanks! This solved my problem.

EvelynBai commented 1 week ago

The reason for this problem is that the graphrag index is called through subprocess. Under this calling method, the environment variables in .env cannot be passed to the command in the subprocess.

here is my solution for you reference.

  1. Prepare a dict
  2. Write the environment variables that need to be passed to graphrag in the dict, and you can get its value from .env
  3. In the subprocess, pass the environment variables in through the env parameter
# get env values from .env file
env={"GRAPHRAG_LLM_MODEL", config("GRAPHRAG_LLM_MODEL", default="gpt-4-turbo-preview")}

# Run the command and stream stdout
with subprocess.Popen(command, stdout=subprocess.PIPE, text=True, env=env) as process:
  if process.stdout:
    for line in process.stdout:
      yield Document(channel="debug", text=line)

I'll try it out. Thank you!