Open xorsuyash opened 3 months ago
Hello I can contribute!
@xorsuyash please populate a new sheet with English PDFs that you'll use for KG creation
@harshaharod21 Please find older documentation links done by Suyash here :
@xorsuyash @GautamR-Samagra hi, can i work on these tasks? i did work on identifying similar questions from a dataset based on quora queries. this issue is aligned to my interests
@xorsuyash Link to the repo where the given issue is implemented https://github.com/harshaharod21/qa_raptor
Note that for now I have used llama as LLM and not openai
@xorsuyash
Can we create a question-answer set on these pdfs first:
Listing out PDFs to start with here:
Next steps:
- Figure out how to extract and visualize the created KG from the parquet files
- Figure out if GraphRAG supports providing an initial ontology while creating the graph
Figure out how the querying engine works for global and local :
- Are they creating cypher queries?
- Are they doing some vector search on the entities, nodes?
- Use the Kharif book (first 247 pages) to test out once code is clear.
1) I have figured out how to visualize the KG there are three ways given in the issue raised:
2) I looked at the source code, but cannot find a way to include base entities except for prompt auto tuning, where we can provide the domain for entity extraction, a similar issue has also being raised for the same: [Feature Request]: Prompt Tuning with given entities · Issue #1010 · microsoft/graphrag ([github.com](http://github.com/))
Update: This is the response I got from the issue where i commented for base entity "In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning."
3) Indexing and querying: The indexing pipelineis configurable, they are composed of workflows,standard and custome steps, prompt templates and input/output adapators.The pipeline is designed to:
Querying: For local search, they have vector store(lancedb),so likely uses vector-based similarity search microsoft.github.io/graphrag/posts/query/notebooks/local_search_nb/
For global : microsoft.github.io/graphrag/posts/query/notebooks/global_search_nb/
The search here is not vector or cypher query based as the no database is created to store the embeddings..
Instead they use map reduce approach here.
cc @GautamR-Samagra
Tasks