LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.02k stars 3.23k forks source link

Create an instruction-detector #143

Closed yk closed 1 year ago

yk commented 1 year ago

There is lots of conversational data on the web, for example twitter, reddit, etc. yet only a tiny fraction of it starts with some sort of instruction or request for a task to be fulfilled. We need a system, either a model or a heuristic (or a combination) to classify text as "instruction-like", which would allow us to harvest data from a wide variety of places.

dhruv2601 commented 1 year ago

Hi @yk, let's do it. From the discussion on #126 I'll be taking this task and proposing a step-by-step solution either tonight or tomorrow.

rohanpatankar926 commented 1 year ago

Hi @yk, let's do it. From the discussion on #126 I'll be taking this task and proposing a step-by-step solution either tonight or tomorrow.

Hey hii Can i also collaborate??

yk commented 1 year ago

Hi @rohanpatankar926 sure. I'm not sure what the best way to go is for that. A suggestion: we let @dhruv2601 come up with an initial implementation, and then iterate on that? @dhruv2601 might then also switch back to #126 while you can improve the detector.

MattiaSangermano commented 1 year ago

Hi, if possible I would like to help with this and/or #126

yk commented 1 year ago

Hi @MattiaSangermano thanks for the interest. See my suggestion above, would that work for you?

MattiaSangermano commented 1 year ago

Yes, at this point I think it's the best way to proceed

totuta commented 1 year ago

@yk hi, I'd also hope to contribute. I read discussions above and think it makes a lot of sense.

agoryuno commented 1 year ago

Could zero-shot classification be a solution? "facebook/bart-large-mnli" on HF gives a >0.7 score for @yk's initial post being a request :)

yk commented 1 year ago

Could zero-shot classification be a solution? "facebook/bart-large-mnli" on HF gives a >0.7 score for @yk's initial post being a request :)

yes it's probably viable to build an ensemble of things like this. depends on how far one can get the noise down

Jmete commented 1 year ago

Hi @dhruv2601 , I have written scripts based on 126 to process tweets into conversation threads. If any model has been trained to detect useful instructions, we could then run it on that file to filter it. If you need the file, I can send it to you via discord. I will also update my fork of the repo soon with the code to do all the processing if anyone wants to download dumps and try from their side.

yk commented 1 year ago

@dhruv2601 any updates on this?

dhruv2601 commented 1 year ago

Hey all and @yk, I've trained a model for this task and it works well. Currently, I am working on testing the model on data other than the validation, i.e. on all kinds of instruction styles possible, and I'm taking the help of GPT-JT and ChatGPT for this. It becomes an iterative process when I discover new instruction styles and add them to training data, and repeat.

The action item currently is to prepare a final model, upload it to HF and create a model card and data collection process. Hopefully, I'll update again in a couple of days.

yk commented 1 year ago

@dhruv2601 thanks a lot for the update. is it possible that you check in the code for this somewhere in the repo under e.g. /model/instruction_detector/ and come to our discord (ping me there) to give a bit of regular updates on it? the issue is, we need to know very accurately what's in the model, both in terms of code and data, in order to use it.

lakshaykc commented 1 year ago

@dhruv2601 Did you use just the twitter data for the model you've trained or used additional datasets?

dhruv2601 commented 1 year ago

Hi, due to work and school deadlines, I have been a bit delayed in updating this task. Plan to be more active in a couple of days.

huu4ontocord commented 1 year ago

@dhruv2601 would love to test your model. ping in data channel in discord.

huu4ontocord commented 1 year ago

@dhruv2601 checking on this again.

Jmete commented 1 year ago

Hello @dhruv2601 , is there any update on the instruction detector model?

totuta commented 1 year ago

@yk , seems like this task is better to be accelerated. Though @dhruv2601 is already on this, may I spend some time on a minimal viable example?

andreaskoepf commented 1 year ago

This issue stalled, not sure what the relevance is, I remove it from the project board for now.