Fine-tune an extractive Question Answer model on our own dataset

SallyMcGrath commented 1 year ago

I want to do this:

I want to fine-tune an extractive QA model on a CodeYourFuture dataset pulled from our docs and our Slack conversations.

Here’s why I want to do it:

I want to provide a SMALL hosted model (probably an autotrain on our Huggingface org to make it as approachable as possible) that trainees can build interfaces to interact with I want to provide an exciting but achievable final project for trainees I want to create a "taster" of AI and ML that is relevant to our course and illuminating for trainees, who have expressed a lot of interest in this emerging field I want to reduce the burden on staff of constantly answering the same 5 questions over and over on Slack

Here’s how it serves our goals:

All our trainees need lots more practice asking good questions and evaluating the answers; projects that create spaces for dialogic pedagogy are needed We Believe in Collective Intelligence An exciting final project should improve performance in hiring: good jobs in tech

This is how much time I can put towards this change:

I have stubbed a dataset to think about this more. I can spend up to 6 1 hour sessions on this I have reached out to some ML/AI experts I know to get advice. I will spend up to 4 2 hours sessions getting advice I have created an organisation on Huggingface -- please join! https://huggingface.co/CodeYourFuture I have drafted (SUPER DRAFTY) an example final project that interacts with this hypothetical model

This is the help I need from others to get this done (if any):

ML devs to evaluate and instruct on the preparation of data
trainees and interns to prepare data (approx 3 hours each)
Any interested parties to build small proof of concepts to share
Possibly a sponsor to fund training cycles if we find they are needed

Emeka1993 commented 1 year ago

Hi @SallyMcGrath I want to know more about this project!

SallyMcGrath commented 1 year ago

Hi hi @Emeka1993 . What do you want to know? I have put as much info as I have on the ticket - what question do you have, specifically?

Emeka1993 commented 1 year ago

What tools are we going to use to prepare the data? and when are we going to start the project?

SallyMcGrath commented 1 year ago

We will start the project when enough people commit time to doing it. I've put what I can offer on the ticket. Other people need to pitch in with what they can offer. When we feel like we have enough to get started, we can begin. (The main page has some more on this philosophy) https://github.com/CodeYourFuture/Changes#readme

To prep the data is still an open question - first we need to properly understand how we can prep the data, then we can build a system to execute this at scale. I have reached out to some AI/ML people I know (would love more people to pitch in!) to get some advice around this. Once we understand how to tag our data, we will pull at first from our public docs: https://docs.codeyourfuture.io/ the drive and I will donate a bunch of my own Slack data -- longer term if this works out, we'd want to build a way to update the model from our Slack (on an opt-in model!).

The linked dataset shows our most commonly hit pages broken up into pieces to help us think through this. The format of this dataset is matched to the format of https://rajpurkar.github.io/SQuAD-explorer/ as I (initially) propose we finetune this model.

Emeka1993 commented 1 year ago

Thanks for the explanation, so once we have enough people, I'll be ready to jump in and start working on the project!

SallyMcGrath commented 1 year ago

Update 👀

A domain expert has looked into this plan and has come up with a better one:

Chaining together semantic search and text generation to create a generative QA. Here's an explanation on NLP Cloud.>

Here's the proposed model example on Hugging Face https://huggingface.co/spaces/deepset/retrieval-augmentation-svb

That's an OSS system made by Deepset / Haystack https://github.com/deepset-ai

Next steps

Take our example queries, feed them into a suitable search engine, and check it finds the right doc.
And that the doc is not too big. (2.5k)
Do whatever the hf-equivalent of forking a space is.

SallyMcGrath commented 1 year ago

@Emeka1993 in the meantime, you might enjoy working through these step by step tutorials on Haystack:

https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline

Emeka1993 commented 1 year ago

Hi @SallyMcGrath Thank you, I'll definitely check out those tutorials on Haystack!

CodeYourFuture / Changes