[Data Creation Refactoring] Re-design QA generation process

Marker-Inc-Korea / AutoRAG

AutoML tool for RAG

https://auto-rag.com/

Apache License 2.0

2.03k stars 164 forks source link

[Data Creation Refactoring] Re-design QA generation process #613

Closed vkehfdl1 closed 1 month ago

vkehfdl1 commented 2 months ago

Current library is little bit complicated if you want to make a custom QA generation process. Making custom QA creation process must be easy.

"The process will change a lot.”

I think the legacy codes will be deprecated and delete at AutoRAG 0.3.0.

[x] #661
[x] #662
[x] #673
[x] #674
[x] #675
[x] #676
[x] #677
[x] #683

vkehfdl1 commented 2 months ago

It can be a kind of monad(????? but is it monad???? I don’t know). There is a dataclass. Let's name it.... Dataset Dataset contains qa and corpus. Every function in Dataset returns Dataset instance, but modified Dataset instance. Every function must be pure function. Its input is Dataset instance, and output is modified Dataset instance. We can load and chunk from the class function. We can make question, make answers. After that, you can convert it to qa_df and corpus_df (dataframes for AutoRAG)

For example (concept)

Dataset.from_directory('path/to/dir').recursive_chunk(chunk_size=512, overlap=128).sample_corpus(n=500).create_factoid_question(n=50).create_basic_answer().save_to_parquet(qa_data_path='./data/qa.parquet', corpus_data_path='./data/corpus.parquet')

vkehfdl1 commented 1 month ago

The basic workflow

Parsing the raw document and make Raw
Initial chunk the corpus from Raw
Sample some chunks to generate questions
Generate questions from the selected chunks
Generate answer and finish the initial QA set
Add more Corpus to use different chunking methods
Map new retrieval_gt from the new Corpus at QA

Looks like this below.

openai_client = AsyncOpenAI()
parsing_result = Raw(pd.read_parquet('./parse.parquet'))
initial_corpus = parsing_result.chunk(lambda data: recursive_split(data, chunk_size=128, chunk_overlap=24))
initial_qa = initial_corpus.sample(
    lambda data: random_single_hop(data, n=50)
).batch_apply(
    lambda row: factoid_query_gen(row, openai_client, lang='ko')
).batch_apply(
    lambda row: lambda row: make_concise_gen_gt(row, openai_client, lang='ko')
)

corpus_list: List[Corpus] = parsing_result.chunk_pipeline('./chunk.yaml')
qa_list: List[QA] = list(map(lambda corpus: initial_qa.update_corpus(corpus), corpus_list))
for i, qa in enumerate(qa_list):
    qa.to_parquet(qa_path=f'./qa_{i}.parquet', corpus_path=f'./corpus_{i}.parquet')

vkehfdl1 commented 1 month ago

All task is done I will do the documentation at this branch for the last time @bwook00