Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Creating a test set of question and answers for testing capabilities #323

Open xorsuyash opened 3 months ago

xorsuyash commented 3 months ago

cc @GautamR-Samagra

Tasks

harshaharod21 commented 3 months ago

Hello I can contribute!

Gautam-Rajeev commented 3 months ago

@xorsuyash please populate a new sheet with English PDFs that you'll use for KG creation

@harshaharod21 Please find older documentation links done by Suyash here :

dev-SARDAR commented 3 months ago

@xorsuyash @GautamR-Samagra hi, can i work on these tasks? i did work on identifying similar questions from a dataset based on quora queries. this issue is aligned to my interests

harshaharod21 commented 3 months ago

@xorsuyash Link to the repo where the given issue is implemented https://github.com/harshaharod21/qa_raptor

Note that for now I have used llama as LLM and not openai

Gautam-Rajeev commented 3 months ago

@xorsuyash

Can we create a question-answer set on these pdfs first:

Listing out PDFs to start with here:

Gautam-Rajeev commented 2 months ago

Next steps:

harshaharod21 commented 2 months ago

Next steps:

  • Figure out how to extract and visualize the created KG from the parquet files
  • Figure out if GraphRAG supports providing an initial ontology while creating the graph
  • Figure out how the querying engine works for global and local :

    • Are they creating cypher queries?
    • Are they doing some vector search on the entities, nodes?
  • Use the Kharif book (first 247 pages) to test out once code is clear.

1) I have figured out how to visualize the KG there are three ways given in the issue raised:

2) I looked at the source code, but cannot find a way to include base entities except for prompt auto tuning, where we can provide the domain for entity extraction, a similar issue has also being raised for the same: [Feature Request]: Prompt Tuning with given entities · Issue #1010 · microsoft/graphrag ([github.com](http://github.com/))

Update: This is the response I got from the issue where i commented for base entity "In the settings.yaml there is the entity_extraction part that contains the entity_types field where you can specify the types of entities you want the LLM to extract, but they are only taken into consideration more as a suggestion when indexing and completely ignored when prompt tuning."

3) Indexing and querying: The indexing pipelineis configurable, they are composed of workflows,standard and custome steps, prompt templates and input/output adapators.The pipeline is designed to:

Querying: For local search, they have vector store(lancedb),so likely uses vector-based similarity search microsoft.github.io/graphrag/posts/query/notebooks/local_search_nb/

For global : microsoft.github.io/graphrag/posts/query/notebooks/global_search_nb/

The search here is not vector or cypher query based as the no database is created to store the embeddings..

Instead they use map reduce approach here.