Tasks

[x] Generating question answer chunks (Global) from agri pdfs.
[x] Using Raptor to cluster chunks and GPT(autotune) to create more context rich question answer pair.

harshaharod21 commented 3 months ago

Hello I can contribute!

Gautam-Rajeev commented 3 months ago

@xorsuyash please populate a new sheet with English PDFs that you'll use for KG creation

@harshaharod21 Please find older documentation links done by Suyash here :

Llama Index implementation here

dev-SARDAR commented 3 months ago

@xorsuyash @GautamR-Samagra hi, can i work on these tasks? i did work on identifying similar questions from a dataset based on quora queries. this issue is aligned to my interests

harshaharod21 commented 3 months ago

@xorsuyash Link to the repo where the given issue is implemented https://github.com/harshaharod21/qa_raptor

Note that for now I have used llama as LLM and not openai

Gautam-Rajeev commented 3 months ago

@xorsuyash

Can we create a question-answer set on these pdfs first:

Listing out PDFs to start with here:

Kharif advisory till page 247
Rabi advisory till page 437
Farmer handbook

Gautam-Rajeev commented 2 months ago

Next steps:

Figure out how to extract and visualize the created KG from the parquet files
Figure out if GraphRAG supports providing an initial ontology while creating the graph
Figure out how the querying engine works for global and local :
- Are they creating cypher queries?
- Are they doing some vector search on the entities, nodes?
Use the Kharif book (first 247 pages) to test out once code is clear.

harshaharod21 commented 2 months ago

Next steps:

Figure out how to extract and visualize the created KG from the parquet files

Figure out if GraphRAG supports providing an initial ontology while creating the graph

Figure out how the querying engine works for global and local :

Are they creating cypher queries?

Are they doing some vector search on the entities, nodes?

Use the Kharif book (first 247 pages) to test out once code is clear.

1) I have figured out how to visualize the KG there are three ways given in the issue raised:

To enable umpa and graphml in the init_content.py file, with this we will get graphml files in the output ,we can use gephi software to visualize
To use the notebook to get the visuals
Using grahrag visualizer

2) I looked at the source code, but cannot find a way to include base entities except for prompt auto tuning, where we can provide the domain for entity extraction, a similar issue has also being raised for the same: [Feature Request]: Prompt Tuning with given entities · Issue #1010 · microsoft/graphrag ([github.com](http://github.com/))

Update: This is the response I got from the issue where i commented for base entity "In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning."

3) Indexing and querying: The indexing pipelineis configurable, they are composed of workflows,standard and custome steps, prompt templates and input/output adapators.The pipeline is designed to:

extract entities, relationships and claims from raw text
perform community detection in entities
generate community summaries and reports at multiple levels of granularity
embed entities into a graph vector space
embed text chunks into a textual vector space The output of the pipleine gives json and parquet files

Querying: For local search, they have vector store(lancedb),so likely uses vector-based similarity search microsoft.github.io/graphrag/posts/query/notebooks/local_search_nb/

For global : microsoft.github.io/graphrag/posts/query/notebooks/global_search_nb/

The search here is not vector or cypher query based as the no database is created to store the embeddings..

Instead they use map reduce approach here.

Map Phase: Divides the data into manageable chunks and processes them independently.
Reduce Phase: Aggregates the intermediate results from the map phase to produce the final output. These both steps is done by LLM, that is why they mention that the process is "This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole"

Samagra-Development / ai-tools

Creating a test set of question and answers for testing capabilities #323

Tasks

Next steps:

Next steps: