LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.99k stars 3.23k forks source link

Train on Youtube Video Subtitles #259

Closed wakaztahir closed 1 year ago

wakaztahir commented 1 year ago

I had this idea :

Youtube video subtitles are pure written form of human spoken words plus there are so many great videos out there , courses and content that is very precise in teaching and solving different problems and its very human like

I think Youtube subtitles could be a great addition to the training dataset. Programming courses would be a great addition.

totuta commented 1 year ago

+1 from me. If it becomes a task, I'd like to contribute.

yk commented 1 year ago

I like it, there is also #184 related. The issue is a bit, we're trying to build an assistant, not a general language model, so we'll have to be very careful to heavily filter to real "instruction-fulfillment" interactions. @totuta I've assigned you, @wakaztahir would you also be up for contributing actively?

The output of this task should not be data itself, but code & instructions to scrape, filter, clean, & process, that data.

@christophschuhmann what are the license considerations here?

totuta commented 1 year ago

@wakaztahir, I will work to come up with an MVP solution v.0.1

Although will start to work on my local, where would be the right place for scripts to reside? @yk?

marianna13 commented 1 year ago

Hi everyone! I already have a working pipeline for YouTube subtitles extraction and I already got subtitles for 1.5M+ videos. Regarding copyright, I don't think that subtitles have strict copyright infringements (see here).

totuta commented 1 year ago

@marianna13 that's great. Can you share the script and possibly the result? And anything I can help here?

huu4ontocord commented 1 year ago

When caputring this information, try to think of how you can convert the dialog into instruciton->answer paths if at all possible. So not just one instruction->answer, but a series. So for example, if this is a how-to (see howto100M), there might be dialog like "How might we make this cake? Well first, we need to get the ingredients. As you can see, I'm doing XYZ". What do we do next? We need to mix , etc. " And then you can break that dialog up into [instruciton->answer][instruction->answer] . If you have a notebook to test it and to show how it is done, that would be good. Please put into the data_aug folder. Also, the goal is to eventually create code or recipie to do the full pipeline [scrape -> convert to instructions -> post process/filter -> convert into training data] . Please ping in the discord data channels if you want to discuss.

wakaztahir commented 1 year ago

I'd really love to work on this but I'm really busy currently

I am really interested in this project.

I also have some other ideas I'd like to add.

I recently scraped open trivia database using their API with a script from Open AI in golang and I have thousands of trivia question answers and very straight forward with each question having incorrect answers and a single answer and some just true false.

Also was thinking if open AI could be used to help with better data scraping since proper data is everything , what better way of training an AI using an another AI.

marianna13 commented 1 year ago

Hi @totuta ! I'm just using yt-dlp and multi threading but if you want I will share the full script :)

marianna13 commented 1 year ago

Hey @ontocord,what do you mean by "convert dialog to instruction" ? Do you mean we can build an additional language model specifically for this task?

Shtoner commented 1 year ago

@marianna13 I'm not sure which ones you've already transcribed, but I was just starting on some of the JRE clips and getting them to text files. I talked to @Rallio67 and know they have been doing Lex Fridman and Stephen Wolfram videos also. Maybe there could be a list somewhere or someone who keeps track of what's already been/being transcribed?

totuta commented 1 year ago

@marianna13 Sure, could you please share the script? Where do you think, in this repo, is the best place to keep it?

@Shtoner I agree with you. We have to come up with a systematic way of keeping (track of) already transcribed data. @yk What do you suggest?

marianna13 commented 1 year ago

@Shtoner I also have downloaded JRE and Lex Ftiedman as well as other podcasts and shows. I can send you a lost of chanbels I scraped if you want but its quite long ;) @totuta sure! Maybe we will add a folder like data scripts

Shtoner commented 1 year ago

Also I've got a Python script that scrapes a site for NBA's players stats from a certain year that I might be able to modify or share with you guys to include other stats, and maybe from different websites. Also just linking this here so these kind of similar issues will be seen at or around the same time.

yk commented 1 year ago

@totuta thanks for hanging on :) I'd roughly follow what's outlined here. The description is very extensive. Take what you need and leave the rest :)

totuta commented 1 year ago

@totuta thanks for hanging on :) I'd roughly follow what's outlined here. The description is very extensive. Take what you need and leave the rest :)

Thanks @yk! btw,

@totuta sure! Maybe we will add a folder like data scripts

What is the best practice for me and @marianna13 to have a folder and work there?

I would like to set up directories(under openassistant/, and work there together and eventually push to the origin/master What is the recommended way?

huu4ontocord commented 1 year ago

@marianna13

Hey @ontocord,what do you mean by "convert dialog to instruction" ? Do you mean we can build an additional language model specifically for this task?

if it is a bunch of diaog about a particular subject, such as fixing a car, you could use a question generator and answerer model to generate a qa pair from the background data. you could also detect phrases using either keyword or embedder such as "Now to do XYZ, you need to ABC "... and so forth, and conver that to an instruction either by simple rules or few shot -> "Human: Please give me the steps for doing XYZ. Assistant: Ok, you do ABC".

yk commented 1 year ago

I would like to set up directories(under openassistant/, and work there together and eventually push to the origin/master What is the recommended way?

I'm not sure where the "not a fork" comes from. I'd recommend to create a fork, work there in a directory under openassistant/datasets/ or so and then raise a PR to origin/main when you're ready

huu4ontocord commented 1 year ago

Where are we on this project @Shtoner @totuta @marianna13 ? Have we found some good candidate youtube transcripts to convert to dialog?

marianna13 commented 1 year ago

The problem with YT subtitles is that it's just plain text with no punctuation. Is there a way to convert it to dialog? @ontocord

wakaztahir commented 1 year ago

we should use AI for that , maybe Chat GPT would be great , "hey chat gpt, convert this plain text to instruction and response that could be fed into another model" but since chat gpt isn't available , use their other OpenAI models (text-davinci)

This way we can convert thousands of text samples , Any text samples to instruction and response , but cannot verify the quality of data

marianna13 commented 1 year ago

I converted some good (human-generated) captions from Lex Friedman to dialogue. Here's what I got:

A: Right, but I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. 
B: Are you sure about that? 
A: I'm betting my life on it. 
B: The following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of "Maps of Meaning", 12 Rules For Life" and "Beyond Order." This is the Lex Fridman podcast, to support it please check out our sponsors in the description. And now dear friends, here's Jordan Peterson. Dostoevsky wrote in "The Idiot", spoken through the character of Prince Myshkin that beauty will save the world. Solzhenitsyn actually mentioned this in his Nobel Prize acceptance speech. What do you think does Dostoevsky meant by that? Was he right? 
A: Well, I guess it's the divine that saves the world, let's say, you could say that by definition. And then you might say, "Well, are there pointers to that which will save the world or that which eternally saves the world?" And the answer to that in all likelihood is yes. And that's maybe truth and love and justice and the classical virtues, beauty perhaps in some sense foremost among them. That's a difficult case to make, but definitely a pointer. 
B: Which direction is the arrow pointing? 
A: Well, the arrow's pointing up. And no, I think that that which it points to is, what beauty points to, it transcends beauty. It's more than beauty. 
B: And that speaks to the divine. 

Then I'm going to use this data to fine-tune a model to generate similar stuff for arbitrary captions. What do you think? What model should I use for that?

wakaztahir commented 1 year ago

I used text-davinci-003 , Here's my prompt , https://beta.openai.com/playground

I will provide you some text , You must understand it and then convert it into question and answer format in json , everything that you learned , here's the text

"A: Right, but I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. 
B: Are you sure about that? 
A: I'm betting my life on it. 
B: The following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of "Maps of Meaning", 12 Rules For Life" and "Beyond Order." This is the Lex Fridman podcast, to support it please check out our sponsors in the description. And now dear friends, here's Jordan Peterson. Dostoevsky wrote in "The Idiot", spoken through the character of Prince Myshkin that beauty will save the world. Solzhenitsyn actually mentioned this in his Nobel Prize acceptance speech. What do you think does Dostoevsky meant by that? Was he right? 
A: Well, I guess it's the divine that saves the world, let's say, you could say that by definition. And then you might say, "Well, are there pointers to that which will save the world or that which eternally saves the world?" And the answer to that in all likelihood is yes. And that's maybe truth and love and justice and the classical virtues, beauty perhaps in some sense foremost among them. That's a difficult case to make, but definitely a pointer. 
B: Which direction is the arrow pointing? 
A: Well, the arrow's pointing up. And no, I think that that which it points to is, what beauty points to, it transcends beauty. It's more than beauty. 
B: And that speaks to the divine. "

Here's the ansewer

Question: "What does Dostoevsky mean when he said that beauty will save the world?"
Answer: "Dostoevsky was suggesting that the divine will save the world, and that beauty is a pointer to that which will save the world - such as truth, love, justice, and the classical virtues." 

JSON: 
{
    "question": "What does Dostoevsky mean when he said that beauty will save the world?",
    "answer": "Dostoevsky was suggesting that the divine will save the world, and that beauty is a pointer to that which will save the world - such as truth, love, justice, and the classical virtues."
}

The data provided is not really technical , let's see when given a wikipedia paragraph

Prompt I will provide you some text , You must understand it and then convert it into question and answer format in json , everything that you learned , here's the text "The front end analyzes the source code to build an internal representation of the program, called the intermediate representation (IR). It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope. While the frontend can be a single monolithic function or program, as in a scannerless parser, it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns. Most commonly today, the frontend is broken into three phases: lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually context-free grammars, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using attribute grammars. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree). In some cases additional phases are used, notably line reconstruction and preprocessing, but these are rare."
Answer Given By Text Davinci { "qnaPairs": [ { "question": "What is the intermediate representation (IR) of a program?", "answer": "The intermediate representation (IR) of a program is an internal representation of the program built by the front end from the source code." }, { "question": "What is a symbol table?", "answer": "A symbol table is a data structure mapping each symbol in the source code to associated information such as location, type and scope." }, { "question": "What is the traditional method for implementing a frontend?", "answer": "The traditional method for implementing a frontend is to break it into several phases which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns." }, { "question": "What are the three phases of the frontend most commonly used today?", "answer": "The three phases of the frontend most commonly used today are lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis." }, { "question": "What are the two components of syntactic analysis?", "answer": "The two components of syntactic analysis are lexing and parsing." }, { "question": "What are the two types of grammars usually used for syntactic analysis?", "answer": "The two types of grammars usually used for syntactic analysis are context-free grammars and attribute grammars." }, { "question": "What are the two steps of parsing?", "answer": "The two steps of parsing are building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree)." }, { "question": "Are there any additional phases used in the frontend process?", "answer": "In some cases additional phases are used in the frontend process, notably line reconstruction and preprocessing, but these are rare." } ] }
marianna13 commented 1 year ago

Oh I think it's actually cool. Do you think we can use some open-source models instead of davinci (e.g. bloom)?

wakaztahir commented 1 year ago

Never tested bloom , gonna give it a try , can't make a million requests to open AI

marianna13 commented 1 year ago

Yeah, I don't think openAI will be suitable for us at scale :)

wakaztahir commented 1 year ago

Just checked bloom , looks nice , should be able to do the job

marianna13 commented 1 year ago

what prompt did you use?

wakaztahir commented 1 year ago

I mean I didn't test it , I just looked it up , searched it , making an assumption here since it's also a really large language model

marianna13 commented 1 year ago

Oh I see! I tried your prompt with bloom and it didn't work ...

Shtoner commented 1 year ago

There is a data augmentation script that takes TSV format that someone built in the "scripts" section of OpenAssistant https://github.com/LAION-AI/Open-Assistant/blob/main/scripts/data_augment/data_augment.py , and another one that also is supposed to create QA format in the "notebooks" section. https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/closed-book-qa/T5_closed_book_QA_generators.py. I haven't ran either yet, but I know the second one requires some pretty good hardware specs to run. To remove timestamps from a VTT, https://web.microsoftstream.com/VTTCleaner/CleanVTT.html works well. Currently seeing if clean VTT to TSV is plausible. I still believe we need to pick a place to list which data is being collected, as I have added the pr to store datasets in the repo directly, but it is not yet approved, and everyone will have to wait until their own pr has been merged to update what is being done in data collection. This could also slow people down not only by having to wait for merging, but people will be expecting fully scrubbed and ready to use data when people collecting and formatting it just need to know what else needs scraped and then made ready. Maybe a specific Discord channel data-collection solely for listing which data is already collected and being formatted to use and we can use the current data-discussion for how we get new data ideas and discussion?

Shtoner commented 1 year ago

@yk I updated the pr earlier; yesterday you said check if it was still necessary; up to you guys.

huu4ontocord commented 1 year ago

The problem with YT subtitles is that it's just plain text with no punctuation. Is there a way to convert it to dialog? @ontocord

you can just run a qg/qa code through the text. i think it will work. i don't think you need to fix the text itself. ping me on discord. This notebook works amazingly well https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/closed-book-qa/T5_closed_book_QA_generators.py

huu4ontocord commented 1 year ago

Can also use https://huggingface.co/felflare/bert-restore-punctuation i trained one along time ago for vietnamese with mt5. wasn't hard.

huu4ontocord commented 1 year ago

@marianna13 running the lower case, no puncuation text through a modified version of the qa generator. {100000: {'topics': [('jordan peterson', -1.37), ('philosophy', -1.96), ('psychology', -2.4), ('christianity', -3.07), ('I was wondering', -3.3)], 'topic prepositions': [('On the topic of jordan peterson,', -2.23), ('In relation to jordan peterson,', -3.42), ('In reference to jordan peterson,', -3.85), ('Jordan Peterson of', -4.8), ("Jordan Peterson's career", -8.29)], 'paragraph': 'right but i would say bring it on if you gaze into the abyss long enough you see the light not the darkness are you sure about that im betting my life on it the following is a conversation with jordan peterson an influential psychologist lecturer podcast host and author of maps of meaning rules for life and beyond order this is the lex fridman podcast to support it please check out our sponsors in the description and now dear friends heres jordan peterson dostoevsky wrote in the idiot spoken through the character of prince myshkin that beauty will save the world solzhenitsyn actually mentioned this in his nobel prize acceptance speech what do you think does dostoevsky meant by that was he right well i guess its the divine that saves the world lets say you could say that by definition and then you might say well are there', 'QA_set': {0: {'question': 'How does the divine save the world?', 'answer_T5_ob': 'The divine saves the world by definition.', 'answer_T5_cb': 'The divine saves the world by bringing people to salvation.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a Christian minister and author.', 'answer_T5_answer': 'I believe the divine saves the world by definition.', 'answer_T5_answer_with_prefix': 'I believe the divine saves the world by definition.'}, 1: {'question': 'What do you have to do to see the light?', 'answer_T5_ob': 'Gaze into the abyss long enough and you will see the light.', 'answer_T5_cb': 'You have to see the light to see the light.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a singer and songwriter.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 2: {'question': 'What does the narrator believe?', 'answer_T5_ob': 'The narrator believes the divine saves the world.', 'answer_T5_cb': 'The narrator believes in the power of prayer.', 'answer_T5_cb_with_prefix': 'The narrator believes that Jordan Peterson is a great basketball player.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 3: {'question': 'What saves the world?', 'answer_T5_ob': 'The divine saves the world.', 'answer_T5_cb': 'The world is saved by the people who live in it.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a fictional character in the television series "The Walking Dead".', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 4: {'question': 'Where did Solzhenitsyn mention that beauty will save the world?', 'answer_T5_ob': 'Solzhenitsyn mentioned that beauty will save the world in his Nobel Prize acceptance speech.', 'answer_T5_cb': 'Solzhenitsyn mentioned that beauty will save the world in his book The Cancer Ward.', 'answer_T5_cb_with_prefix': 'Solzhenitsyn mentioned that beauty will save the world in his book The Cancer Ward.', 'answer_T5_answer': 'I believe Solzhenitsyn mentioned that beauty will save the world in his Nobel Prize acceptance speech.', 'answer_T5_answer_with_prefix': 'I believe Solzhenitsyn mentioned that beauty will save the world in his Nobel Prize acceptance speech.'}, 5: {'question': 'What is the purpose of the podcast?', 'answer_T5_ob': 'The purpose of the podcast is to support it please check out our sponsors in the description.', 'answer_T5_cb': 'The purpose of the podcast is to provide entertainment.', 'answer_T5_cb_with_prefix': 'The purpose of the podcast is to discuss the latest news and events in the world of sports.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 6: {'question': 'What does Dostoevsky say will save the world?', 'answer_T5_ob': 'Dostoevsky says that beauty will save the world.', 'answer_T5_cb': 'The world will be saved by the rebirth of the human spirit.', 'answer_T5_cb_with_prefix': 'The world will be saved by the love of God, Dostoevsky said.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 7: {'question': 'What is the name of the person who is being interviewed?', 'answer_T5_ob': 'Jordan Peterson is being interviewed.', 'answer_T5_cb': 'NA.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a professional basketball player.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 8: {'question': 'What did Dostoevsky say in The Idiot?', 'answer_T5_ob': 'Dostoevsky said in The Idiot that beauty will save the world.', 'answer_T5_cb': 'In The Idiot, Dostoevsky says that the only way to be happy is to be a fool.', 'answer_T5_cb_with_prefix': 'Dostoevsky said in The Idiot that "the only thing that is necessary for the abolition of the aristocracy is the abolition of the aristocracy', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 9: {'question': 'Where can you check out our sponsors?', 'answer_T5_ob': 'You can check out our sponsors in the description.', 'answer_T5_cb': 'You can check out our sponsors on our website.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a professional basketball player.', 'answer_T5_answer': 'You can check out our sponsors in the description.', 'answer_T5_answer_with_prefix': 'You can check out our sponsors in the description.'}, 10: {'question': 'Who is the psychologist?', 'answer_T5_ob': 'Jordan Peterson is a psychologist.', 'answer_T5_cb': 'The psychologist is Dr. Smith.', 'answer_T5_cb_with_prefix': 'Jordan Peterson is a psychologist.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': 'Jordan Peterson is a psychologist.'}, 11: {'question': 'Who wrote in the idiot that beauty will save the world?', 'answer_T5_ob': 'In The Idiot, Dostoevsky wrote that beauty will save the world.', 'answer_T5_cb': 'The Idiot That Beauty Will Save the World is a novel by American author John Irving.', 'answer_T5_cb_with_prefix': 'Jordan Peterson wrote In the Idiot That Beauty Will Save the World.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}}}, 100001: {'topics': [('beauty', -0.99), ('love', -1.65), ('religion', -2.11), ('philosophy', -2.64), ('I was wondering', -3.3)], 'topic prepositions': [('In terms of beauty,', -2.74), ('In relation to beauty,', -3.53), ('About beauty', -4.08), ('As to beauty,', -6.71), ('For the beauty of the day,', -9.22)], 'paragraph': 'pointers to that which will save the world or that which eternally saves the world and the answer to that in all likelihood is yes and thats maybe truth and love and justice and the classical virtues beauty perhaps in some sense foremost among them thats a difficult case to make but definitely a pointer which direction is the arrow pointing well the arrows pointing up and no i think that that which it points to is what beauty points to it transcends beauty its more than beauty and that speaks to the divine ', 'QA_set': {0: {'question': 'What direction is the arrow pointing?', 'answer_T5_ob': 'The arrow is pointing up.', 'answer_T5_cb': 'The arrow is pointing to the right.', 'answer_T5_cb_with_prefix': 'The arrow is pointing to the right.', 'answer_T5_answer': 'I believe the arrow is pointing up.', 'answer_T5_answer_with_prefix': 'I believe the arrow is pointing up.'}, 1: {'question': 'What is the arrow pointing to?', 'answer_T5_ob': 'The arrow is pointing to beauty.', 'answer_T5_cb': 'The arrow is pointing to a person.', 'answer_T5_cb_with_prefix': 'The arrow is pointing to the beauty of the flower.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': 'I believe the arrow is pointing to beauty.'}, 2: {'question': 'What is the difficulty of the case?', 'answer_T5_ob': 'The case is a difficult case to make.', 'answer_T5_cb': 'The case is very difficult because the defendant is a convicted felon.', 'answer_T5_cb_with_prefix': 'The difficulty of the case is the difficulty of the case.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': "I don't know. I cannot tell you the answer with the information I have."}, 3: {'question': 'What is the classical virtue that points to that which eternally saves the world?', 'answer_T5_ob': 'Beauty is the classical virtue that points to that which eternally saves the world.', 'answer_T5_cb': 'The classical virtue that points to that which eternally saves the world is piety.', 'answer_T5_cb_with_prefix': 'Beauty is the classical virtue that points to that which eternally saves the world.', 'answer_T5_answer': "I don't know. I cannot tell you the answer with the information I have.", 'answer_T5_answer_with_prefix': 'Beauty is the classical virtue that points to that which eternally saves the world.'}, 4: {'question': "What is the narrator's opinion of beauty?", 'answer_T5_ob': 'The narrator believes beauty transcends beauty.', 'answer_T5_cb': 'The narrator thinks that beauty is subjective.', 'answer_T5_cb_with_prefix': 'The narrator thinks that beauty is in the eye of the beholder.', 'answer_T5_answer': 'I believe the narrator believes beauty transcends beauty.', 'answer_T5_answer_with_prefix': 'I believe the narrator believes beauty transcends beauty.'}, 5: {'question': 'What does beauty point to?', 'answer_T5_ob': 'Beauty points to something more than beauty and that speaks to the divine.', 'answer_T5_cb': 'Beauty is a sign of the divine.', 'answer_T5_cb_with_prefix': 'Beauty is a concept that is used to describe the attractiveness of a person, place, or thing.', 'answer_T5_answer': 'Beauty points to something more than beauty and that speaks to the divine.', 'answer_T5_answer_with_prefix': 'Beauty points to something more than beauty and that speaks to the divine.'}}}}

huu4ontocord commented 1 year ago

Feel free to adopt and change as you see fit. I can explain more in discord.

# -*- coding: utf-8 -*-
# This notebook will run on a system with a single RTX3090 (24 GB vram).
# You need to install accelerate, bitsandbytes, and transformers

import math
import pickle
import time

import torch
# load all needed libraries
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel

# This device map will work a GPU with > 24GB vram.
# It uses nearly all the memory.
device_map_T5_13B = {
    "shared": 0,
    "decoder.embed_tokens": 0,
    "encoder.embed_tokens": 0,
    "encoder.block.0": 0,
    "encoder.block.1": 0,
    "encoder.block.2": 0,
    "encoder.block.3": 0,
    "encoder.block.4": 0,
    "encoder.block.5": 0,
    "encoder.block.6": 0,
    "encoder.block.7": 0,
    "encoder.block.8": 0,
    "encoder.block.9": 0,
    "encoder.block.10": 0,
    "encoder.block.11": 0,
    "encoder.block.12": 0,
    "encoder.block.13": 0,
    "encoder.block.14": 0,
    "encoder.block.15": 0,
    "encoder.block.16": 0,
    "encoder.block.17": 0,
    "encoder.block.18": 0,
    "encoder.block.19": 0,
    "encoder.block.20": 0,
    "encoder.block.21": 0,
    "encoder.block.22": 0,
    "encoder.block.23": 0,
    "encoder.final_layer_norm": 0,
    "encoder.dropout": 0,
    "decoder.block.0": 0,
    "decoder.block.1": 0,
    "decoder.block.2": 0,
    "decoder.block.3": 0,
    "decoder.block.4": 0,
    "decoder.block.5": 0,
    "decoder.block.6": 0,
    "decoder.block.7": 0,
    "decoder.block.8": 0,
    "decoder.block.9": 0,
    "decoder.block.10": 0,
    "decoder.block.11": 0,
    "decoder.block.12": 0,
    "decoder.block.13": 0,
    "decoder.block.14": 0,
    "decoder.block.15": 0,
    "decoder.block.16": 0,
    "decoder.block.17": 0,
    "decoder.block.18": 0,
    "decoder.block.19": 0,
    "decoder.block.20": 0,
    "decoder.block.21": 0,
    "decoder.block.22": 0,
    "decoder.block.23": 0,
    "decoder.final_layer_norm": 0,
    "decoder.dropout": 0,
    "lm_head": 0,
}

# Load the model in bfloat16. Make sure to use bfloat16
# if you are doing inference with 16bit precision.
try:
  if tokenizer is not None:
    pass
except:
  tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")
  model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-xxl",
    device_map=device_map_T5_13B,
    torch_dtype=torch.bfloat16,
    load_in_8bit=False,
)

  minilm_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
  minilm_model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2').half().eval().cuda()  

# Load strings as knowledge sources for QA generation.
# You can do this with a pickle.
#objects = []
#with (open("paragraphs.pkl", "rb")) as openfile:
#    while True:
#        try:
#            objects.append(pickle.load(openfile))
#        except EOFError:
#            break
paragraphs = ["right but i would say bring it on if you gaze into the abyss long enough you see the light not the darkness are you sure about that im betting my life on it the following is a conversation with jordan peterson an influential psychologist lecturer podcast host and author of maps of meaning  rules for life and beyond order this is the lex fridman podcast to support it please check out our sponsors in the description and now dear friends heres jordan peterson dostoevsky wrote in the idiot spoken through the character of prince myshkin that beauty will save the world solzhenitsyn actually mentioned this in his nobel prize acceptance speech what do you think does dostoevsky meant by that was he right well i guess its the divine that saves the world lets say you could say that by definition and then you might say well are there", 
"pointers to that which will save the world or that which eternally saves the world and the answer to that in all likelihood is yes and thats maybe truth and love and justice and the classical virtues beauty perhaps in some sense foremost among them thats a difficult case to make but definitely a pointer which direction is the arrow pointing well the arrows pointing up and no i think that that which it points to is what beauty points to it transcends beauty its more than beauty and that speaks to the divine "]
# Make sure no paragraphs are too long for T5.
# It handles up to 512 tokens context length.
fixed_paragraphs = []
for k in paragraphs:
    if len(k) > 1100:
        k = k[:1100]

    fixed_paragraphs.append(k)
print("Original number of paragraphs:", len(paragraphs))
print("Length filtered number of paragraphs:", len(fixed_paragraphs))
paragraphs = fixed_paragraphs

# Sort_Tuple sorts a list of tuples
# by the second element.
def Sort_Tuple(tup):
    tup.sort(key=lambda x: x[1], reverse=True)
    return tup

# ask_flan_T5 takes a text input and returns the
# response of FLAN_T5 and a normalized logits
# score for the generation.
def ask_flan_T5(input_text):
    inputs = tokenizer.encode(input_text, return_tensors="pt").cuda(0)
    outputs = model.generate(
        inputs,
        do_sample=True,
        top_p=0.95,
        eos_token_id=1,
        max_new_tokens=50,
        bos_token_id=0,
        temperature=0.9,
        return_dict_in_generate=True,
        output_scores=True,
    )
    out_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
    probs = torch.stack(outputs.scores, dim=1).softmax(-1)
    for i in outputs.sequences:
        logprobs = 0
        counter = 0
        for k in i[1:]:
            word_prob = (round(probs[0][counter][k.item()].item(), 2)) + 0.001
            logprobs = logprobs + math.log(word_prob)
            counter += 1
        out_tuple = (out_text, round(logprobs, 2))
    return out_tuple

# ask_flan_T5D is a function that takes an input text and
# returns the deterministic(do_sample=False) output of
# FLAN_T5 and logits.
def ask_flan_T5D(input_text):
    inputs = tokenizer.encode(input_text, return_tensors="pt").cuda(0)
    outputs = model.generate(
        inputs,
        do_sample=False,
        eos_token_id=1,
        max_new_tokens=50,
        bos_token_id=0,
        return_dict_in_generate=True,
        output_scores=True,
    )
    out_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
    probs = torch.stack(outputs.scores, dim=1).softmax(-1)
    for i in outputs.sequences:
        logprobs = 0
        counter = 0
        for k in i[1:]:
            word_prob = (round(probs[0][counter][k.item()].item(), 2)) + 0.001
            logprobs = logprobs + math.log(word_prob)
            counter += 1
        out_tuple = (out_text, round(logprobs, 2))
    return out_tuple

# Generate a topic classifier for a paragraph of text
def generate_topic(paragraph):
    results = set()
    input_text = (
        "Task: Create a topic classifier for the provided \
        paragraph.\nParagraph:\n"
        + paragraph
        + "\nTopic: "
    )
    for k in range(0, 20):
        result = ask_flan_T5(input_text)
        if result[1] > -4:
            results.add(result)
        if len(results) < 3:
            results.add(("I was wondering", -3.3))
            results.add(("I have a question", -3.3))
    sorted_results = Sort_Tuple(list(results))
    return sorted_results[0:5]

# Generate a topic classifier for a paragraph of text
def generate_topic_prefix(topic_set):
    results = set()
    for entry in topic_set:
        topic = entry[0]
        input_text = (
            "Task: Create a prepositional phrase about the topic.\n\
            Example 1\n Topic: climbing mount everest\nPrepositional \
            Phrase: With regards to climbing mount everest,\nExample \
            2\nTopic: United States Air Force\nPrepositional Phrase: \
            On the topic of the United States Air Force,\n Example 3\nTopic: "
            + topic
            + "\nPrepositional Phrase: "
        )
        for k in range(0, 5):
            results.add(ask_flan_T5(input_text))
        sorted_results = Sort_Tuple(list(results))
        return sorted_results[0:5]

# Generate who/what/where/when/why questions from a paragraph.
# Number of questions variable is an integer which indicates how
# many of each question type to try to generate.
def generate_questions(paragraph, number_of_questions):
    if len(tokenizer.encode(paragraph)) > 480:
        print("Warning, the context length is too long.")
    question_set = set()
    question_types = [
        "What",
        "Where",
        "Why",
        "How",
        "Who",
        "How much",
        "When",
        "Which"
    ]
    for qtype in question_types:
        question = (
            "Please generate a question that starts with '"
            + qtype
            + "' based on the following paragraph.\nText:\n"
            + paragraph
            + "\nQuestion:\n"
        )
        for k in range(0, number_of_questions):
            new_question = ask_flan_T5(question)
            if qtype in new_question[0]:
                question_set.add((qtype, new_question))
    return question_set

# Generate answers for a set of questions.
# Input is the paragraph of text and a set of questions where each question
# is a tuple generated from the generate_questions() function.
def generate_answers(paragraph, question_set):
    possible_answers = set()
    for question in question_set:
        input_text = (
            "Please read the following paragraph and \
            then answer the question using only data \
            found in the text. If no answer is possible, respond \
            'NA'.\nText:\n"
            + paragraph
            + "\nQuestion:\n"
            + question[1][0]
            + "\nAnswer:\n"
        )
        answer = ask_flan_T5D(input_text)
        possible_answers.add((question[0], question[1], answer))
    return possible_answers

# Generate questions from a paragraph and set of answers.
# Input is the paragraph of text and a set of answers where each question
# is a tuple generated from the generate_answers() function.
def generate_question2(paragraph, qa_set):
    qaq_results = set()
    for qa_item in qa_set:
        answer = qa_item[2][0]
        input_text = (
            "Please read the following paragraph and \
            then generate a question whose answer is: "
            + answer
            + "\nParagraph:\n"
            + paragraph
            + "\nQuestion:\n"
        )
        result = ask_flan_T5D(input_text)
        qaq_results.add((qa_item[0], qa_item[1], qa_item[2], result))
    return qaq_results

# Generate answers from a paragraph and set of questions.
# Input is the paragraph of text and a set of questions where each answer
#  is a tuple generated from the generate_questions2() function.
def generate_answers2(paragraph, question_set):
    possible_answers = set()
    for question in question_set:
        input_text = (
            "Please read the following paragraph and \
            then answer the question using only data \
            found in the text. If no answer is possible, respond \
            'NA'.\nText:\n"
            + paragraph
            + "\nQuestion:\n"
            + question
            + "\nAnswer:\n"
        )
        answer = ask_flan_T5D(input_text)
        possible_answers.add((question, answer))
    return possible_answers

# Generate declarative statement from question and answer pair.
def generate_declarative(qaq_set):
    qaqd_results = set()
    for qa_item in qaq_set:
        question = qa_item[0]
        answer = qa_item[1][0]
        if "NA" in answer:
            qaqd_results.add((question, answer,  qa_item[1]))
        else:
            input_text = (
                "Generate a declarative statement based on the \
                given question and answer pair.\nQ: What is \
                sitting on the couch?\nA: poodle\nA poodle is \
                sitting on the couch.\nQ: "
                + question
                + "\nA: "
                + answer
                + "\n"
            )
            result = ask_flan_T5D(input_text)
            qaqd_results.add((question, answer, result))
    return qaqd_results

# Generate closed book answer to question.
def generate_closed_answer(qaqd_set, topic_prefix):
    if topic_prefix:
      topic_prefix= [a[0] for a in topic_prefix]
      topic_prefix.sort(key=lambda a: len(a[0]), reverse=True)
      topic_prefix = topic_prefix[0]
    else:
      topic_prefix = None
    qaqd_results = set()
    for qa_item in qaqd_set:
        question = qa_item[0]
        answer = qa_item[2][0]
        if "NA" in answer:
            # print(answer)
            if len(qa_item) == 3:
              qaqd_results.add((qa_item[0], qa_item[1], qa_item[2], qa_item[2]))
            else:
              qaqd_results.add((qa_item[0], qa_item[1], qa_item[2], qa_item[2],  qa_item[2]))
            pass
        else:
            input_text = (
                "Task: Answer the question in a detailed fashion. \
                If the question cannot be answered without more \
                information, please answer NA.\nExample 1:\nQuestion: \
                Why does Shala like cookies?\nAnswer: It is not possible \
                to know why Shala likes cookies without more information, \
                but many people that like cookies enjoy their taste or \
                some of their ingredients (e.g. chocolate chips or \
                peanut butter).\nExample 2:\nQuestion: Why would someone \
                vote in an election?\nAnswer: There are many reasons \
                someone might vote in an election, for instance to have \
                their voice heard or to help a candidate they like win the \
                race.\nExample 3\nQuestion: What decoration goes on top of \
                a Christmas tree?\nAnswer: Usually a star is placed at the \
                top of a Christmas tree.\nExample 4:\nQuestion: "
                + (question if topic_prefix is None else (topic_prefix + " " + question))
                + "\nAnswer: "
            )
            result = ask_flan_T5D(input_text)
            if len(qa_item) == 3:
              qaqd_results.add((qa_item[0], qa_item[1], qa_item[2], result))
            else:
              qaqd_results.add((qa_item[0], qa_item[1], qa_item[2], qa_item[3], result))
    return qaqd_results

def mean_pooling(model_output, attention_mask):
    with torch.no_grad():
      token_embeddings = model_output.last_hidden_state
      input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).to(token_embeddings.dtype)
      return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

from torch.nn.functional import cosine_similarity

# Create a dictionary of questions and answers from a list of paragraphs.
# Takes about 20 seconds per paragraph to process.
start_time = time.perf_counter()
questions_dict = {}
uniq_id = 100000
for paragraph in paragraphs[0:1500]:
    topic_list = generate_topic(paragraph)
    topic_prefix = generate_topic_prefix(topic_list)
    question_set = generate_questions(paragraph, 2)
    qa_set = generate_answers(paragraph, question_set)
    qaq_set = generate_question2(paragraph, qa_set)
    q2_set = set()
    for q in qaq_set:
        q2_set.add(q[3][0])
    q2a2_set = generate_answers2(paragraph, q2_set)
    a2d_set = generate_declarative(q2a2_set)
    a3cb_set = generate_closed_answer(a2d_set, None)
    a3cb_set = generate_closed_answer(a3cb_set, topic_prefix)
    questions_dict[uniq_id] = {}
    questions_dict[uniq_id]["topics"] = topic_list
    questions_dict[uniq_id]["topic prepositions"] = topic_prefix
    questions_dict[uniq_id]["paragraph"] = paragraph
    entry_count = 0
    entry_dict = {}
    for entry in a3cb_set:
        entry_dict[entry_count] = {}
        entry_dict[entry_count]["question"] = entry[0]
        entry_dict[entry_count]["answer_T5_ob"] = entry[2][0]
        entry_dict[entry_count]["answer_T5_cb"] = entry[3][0]
        entry_dict[entry_count]["answer_T5_cb_with_prefix"] = entry[4][0]
        if entry_dict[entry_count]["answer_T5_ob"] == "NA":
          entry_dict[entry_count]["answer_T5_answer"] = "Either I do not undersand this question, or this question cannot be answered."
        else:
          toks = minilm_tokenizer(entry_dict[entry_count]["answer_T5_ob"], padding=True, truncation=True, return_tensors="pt").to('cuda')
          dat = minilm_model(**toks)
          dat = mean_pooling(dat, toks.attention_mask)
          cb_answer = entry_dict[entry_count]["answer_T5_cb"]

          toks = minilm_tokenizer(cb_answer, padding=True, truncation=True, return_tensors="pt").to('cuda')
          dat2 = minilm_model(**toks)
          dat2 = mean_pooling(dat2, toks.attention_mask)
          score = cosine_similarity(dat, dat2).item()
          if score < 0.75:
             entry_dict[entry_count]["answer_T5_answer"] = "I don't know. I cannot tell you the answer with the information I have."
          elif score < 0.8:
            if entry_dict[entry_count]["answer_T5_ob"].split()[0].lower() in {'the', 'this', 'a', 'an'}:
               entry_dict[entry_count]["answer_T5_answer"] = "I don't know for certain, but maybe "+entry_dict[entry_count]["answer_T5_ob"][0].lower()+ entry_dict[entry_count]["answer_T5_ob"][1:]          
            else:
               entry_dict[entry_count]["answer_T5_answer"] = "I don't know for certain, but maybe "+ entry_dict[entry_count]["answer_T5_ob"]
          elif score < 0.9:
            if entry_dict[entry_count]["answer_T5_ob"].split()[0].lower() in {'the', 'this', 'a', 'an'}:
               entry_dict[entry_count]["answer_T5_answer"] = "I believe "+ entry_dict[entry_count]["answer_T5_ob"][0].lower()+ entry_dict[entry_count]["answer_T5_ob"][1:]          
            else:
               entry_dict[entry_count]["answer_T5_answer"] = "I believe "+ entry_dict[entry_count]["answer_T5_ob"]
          else:
            entry_dict[entry_count]["answer_T5_answer"] = entry_dict[entry_count]["answer_T5_ob"] 
          entry_dict[entry_count]["answer_T5_answer_with_prefix"]  = entry_dict[entry_count]["answer_T5_answer"] 
          if len(cb_answer) < len(entry_dict[entry_count]["answer_T5_cb_with_prefix"]):
            cb_answer = entry_dict[entry_count]["answer_T5_cb_with_prefix"]

            toks = minilm_tokenizer(cb_answer, padding=True, truncation=True, return_tensors="pt").to('cuda')
            dat2 = minilm_model(**toks)
            dat2 = mean_pooling(dat2, toks.attention_mask)
            if score < cosine_similarity(dat, dat2).item():
              if cosine_similarity(dat, dat2).item() < 0.75:
                entry_dict[entry_count]["answer_T5_answer_with_prefix"] = "I don't know. I cannot tell you the answer with the information I have."
              elif cosine_similarity(dat, dat2).item() < 0.8:
                if entry_dict[entry_count]["answer_T5_ob"].split()[0].lower() in {'the', 'this', 'a', 'an'}:
                  entry_dict[entry_count]["answer_T5_answer_with_prefix"] = "I don't know for certain, but maybe "+entry_dict[entry_count]["answer_T5_ob"][0].lower()+ entry_dict[entry_count]["answer_T5_ob"][1:]          
                else:
                  entry_dict[entry_count]["answer_T5_answer_with_prefix"] = "I don't know for certain, but maybe "+ entry_dict[entry_count]["answer_T5_ob"]
              elif cosine_similarity(dat, dat2).item() < 0.9:
                if entry_dict[entry_count]["answer_T5_ob"].split()[0].lower() in {'the', 'this', 'a', 'an'}:
                  entry_dict[entry_count]["answer_T5_answer_with_prefix"] = "I believe "+ entry_dict[entry_count]["answer_T5_ob"][0].lower()+ entry_dict[entry_count]["answer_T5_ob"][1:]          
                else:
                  entry_dict[entry_count]["answer_T5_answer_with_prefix"] = "I believe "+ entry_dict[entry_count]["answer_T5_ob"]
              else:
                entry_dict[entry_count]["answer_T5_answer_with_prefix"] = entry_dict[entry_count]["answer_T5_ob"] 

             #'**', entry_dict[entry_count]["answer_T5_ob"], '**', entry_dict[entry_count]["answer_T5_cb"])
        entry_count += 1
    questions_dict[uniq_id]["QA_set"] = entry_dict
    uniq_id += 1
    print(uniq_id, "topics:", topic_prefix)

stop_time = time.perf_counter()
generation_time = stop_time - start_time
print(questions_dict[uniq_id - 1])
print(generation_time)

import json
with open("questions_dict.jsonl", "w") as output:
  for qd in questions_dict.values():  
    output.write(json.dumps(qd)+"\n")
!cp questions_dict.jsonl /content/drive/Shareddrives/LAION
marianna13 commented 1 year ago

It's very nice! Thank you! What is your discord @?

totuta commented 1 year ago

@yk , I am back on this issue. Are you available for a discussion? To briefly share the preliminary results:

yk commented 1 year ago

@yk , I am back on this issue. Are you available for a discussion? To briefly share the preliminary results:

  • one or more Q&A pair(s) can be generated from YouTube autogenerated subs + GPT3(davinci-002 or 003)

    • 003 is usually better with more Q&A pairs, but not always

    • the quality and numbers of pairs are very brittle

    • @wakaztahir tested BLOOM, didn't get a good result, but different prompt might work. I will give it a try

Sure. Keep in mind that OpenAI TOS don't allow for creating training data using their models, as far as I know, so we want to stay away from that. DM me on discord for further discussion, I easily miss stuff here on gh

totuta commented 1 year ago

@Shtoner I also have downloaded JRE and Lex Ftiedman as well as other podcasts and shows. I can send you a lost of chanbels I scraped if you want but its quite long ;) @totuta sure! Maybe we will add a folder like data scripts

Can you share the list of channels? You can DM it on discord

CheckMC commented 1 year ago

How are you planning to choose which videos to take transcripts from? Manually choosing / having users submit? I'm guessing that the more high quality videos we have the better

marianna13 commented 1 year ago

Hi @totuta! I tried with t5 (base) and results depend on the length of the context in most cases. Here's some examples: https://huggingface.co/datasets/marianna13/yt-qa

On Wed, 22 Feb 2023, 03:56 totuta, @.***> wrote:

@yk https://github.com/yk , I am back on this issue. Are you available for a discussion? To briefly share the preliminary results:

  • one or more Q&A pair(s) can be generated from YouTube autogenerated subs + GPT3(davinci-002 or 003)
    • 003 is usually better with more Q&A pairs, but not always
    • the quality and numbers of pairs are very brittle

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/259#issuecomment-1439287775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJGPGR7IDDKLRUJGQQ3WYVP5TANCNFSM6AAAAAATORGTBE . You are receiving this because you were mentioned.Message ID: @.***>

totuta commented 1 year ago

Hi, @yk , @ontocord, @marianna13, @CheckMC, and @Vechtomov,

I have been working on this idea of, converting HowTo100M dataset into pairs of an instruction(questioning how to) and a response of many sentences.

As the HowTo100M is originally ideated for linking bag-of-words to (short) video clips, we need a bit of twists for our purpose. Those steps are:

  1. Get video_ids from HowTo100M.
  2. Get the title and "convert" it into a request form, probably using a generative language model. A smaller and open-sourced version will do the job(no need to use Open AI's).
    • e.g. Title: Dermatology Treatments : How to Get Rid of Back Acne Fast --> "What are the ways to quickly eliminate back acne through dermatological treatments?"
  3. Get the subtitles.
  4. Now that is one datapoint of instruction-response for you.

I think it will work for us. Although the "response" is a bit too long(I don't have the stats of length yet), ChatGPT also generates very lengthy responses as well. And the #datapoints is 1,241,005, which is not a bad number at all.

Please check the first version of prepare.py below. All remaining TODOs are straightforward. https://github.com/totuta/Open-Assistant/blob/259_youtube_subs_dataset/data/datasets/youtube_subs/prepare.py

Please do provide me with feedbacks because I wanna fix any issue as early as possible.

davide221 commented 1 year ago

Would love to take part to this task. Is possible?

totuta commented 1 year ago

Would love to take part to this task. Is possible?

@davide221 I almost finished the first version of youtube_subs_howto100M. I'll PR it tomorrow. We can think about how to collaborate after that.