Closed vkehfdl1 closed 1 month ago
It can be a kind of monad(????? but is it monad???? I don’t know).
There is a dataclass. Let's name it.... Dataset
Dataset
contains qa
and corpus
.
Every function in Dataset
returns Dataset
instance, but modified Dataset
instance.
Every function must be pure function
. Its input is Dataset
instance, and output is modified Dataset
instance.
We can load and chunk from the class function.
We can make question, make answers.
After that, you can convert it to qa_df and corpus_df (dataframes for AutoRAG)
For example (concept)
Dataset.from_directory('path/to/dir').recursive_chunk(chunk_size=512, overlap=128).sample_corpus(n=500).create_factoid_question(n=50).create_basic_answer().save_to_parquet(qa_data_path='./data/qa.parquet', corpus_data_path='./data/corpus.parquet')
Raw
Raw
QA
setCorpus
to use different chunking methodsCorpus
at QA
Looks like this below.
openai_client = AsyncOpenAI()
parsing_result = Raw(pd.read_parquet('./parse.parquet'))
initial_corpus = parsing_result.chunk(lambda data: recursive_split(data, chunk_size=128, chunk_overlap=24))
initial_qa = initial_corpus.sample(
lambda data: random_single_hop(data, n=50)
).batch_apply(
lambda row: factoid_query_gen(row, openai_client, lang='ko')
).batch_apply(
lambda row: lambda row: make_concise_gen_gt(row, openai_client, lang='ko')
)
corpus_list: List[Corpus] = parsing_result.chunk_pipeline('./chunk.yaml')
qa_list: List[QA] = list(map(lambda corpus: initial_qa.update_corpus(corpus), corpus_list))
for i, qa in enumerate(qa_list):
qa.to_parquet(qa_path=f'./qa_{i}.parquet', corpus_path=f'./corpus_{i}.parquet')
All task is done I will do the documentation at this branch for the last time @bwook00
Current library is little bit complicated if you want to make a custom QA generation process. Making custom QA creation process must be easy.
"The process will change a lot.”
I think the legacy codes will be deprecated and delete at AutoRAG 0.3.0.