[ ] receive the doc (pdf) from the user - @xorsuyash
[ ] give that to to autotune - @xorsuyash
[ ] Document service will chunk the pdf and send back the chunks and json - @sooraj1002 to share the API to @xorsuyash and give handover of how to use
[ ] you will create a list of prompts using each chunk : e.g: 'create 2 questions from this chunk : {chunk1} '
prompt 1: 'create 2 questions from this chunk : {chunk1} '
prompt2: 'create 2 questions from this chunk : {chunk2} ' @xorsuyash
[ ] send the list of prompts to auto-tune and this will return a json/csv of question answer pairs. - create train/eval/test splits as specified by the user - @sooraj1002 will handover to @xorsuyash
[ ] measure the current retrieval accuracy of 'pre-trained' model by asking to retrieve on questions and marking the related chunk as the chunk to be retrieved @xorsuyash
[ ] you'll create triplets of {q,positive_doc,negaitve-doc} for each question using the chunks and the q-a dataset @TakshPanchal will handover to @xorsuyash
[ ] this dataset will be shared HF and uploaded there @xorsuyash
[ ] Embedding training integration with auto-tune : the dataset HF link will be shared with autotune along with pretrained model. Autotune will fine tune the model and upload the finetuned model onto HF ( upload of model onto HF needs to be done such that an existing repo of finetuned models is updated).
user should again pass train/test/eval splits as a part of Trainer config to autotrain, autotrain will default pick up 'train','test','eval' datasets.. it'll raise a flag if 'eval'/'test' split is not there..
Trainer class has 'eval' split argument, it overrides other datasets. ( copy HF ka trainer class)
commit history of the model update should have dataset used to train, evaluation logs and timestamp etc .. the repo commits should have some version control to be able to pick up older models and fix them as latest if necessary. @xorsuyash and @TakshPanchal
[x] schedule train @TakshPanchal
[ ] my trained model should not be pushed on HF as the 'latest model - model to be picked up' unless the eval results are validated by user @TakshPanchal
[ ] Autotune should allow me to update an existing dataset with another HF dataset by passing the 2 HF dataset links @sooraj1002
[ ] fine tuned model retrievla accuracy should also be measured @xorsuyash
[ ] Add API for embedding models - Calls PDF parser -> Create the negative questions to create triplet -> fine-tune a model (this is for sentence similarity - optionally a document) @xorsuyash
https://app.diagrams.net/#G11tk9s4YZBIvWqAmBB6_8pvo0dppWTGbi#%7B%22pageId%22%3A%22uskC_wnftH2uWe6gHLMZ%22%7D
label studio playground