QiaolingChen00 commented 1 year ago

refer from https://github.com/tatsu-lab/stanford_alpaca

QiaolingChen00 commented 1 year ago

4.16 (Sun) 21:00

[x] 训练完成 (chen)
[x] 数据集准备 (yang, wang, gan)
- [x] 使用 API 获得批量数据示例代码： https://github.com/Chenqll/NUS_plp_project/issues/1#issuecomment-1506529446 期待的数据格式： { "instruction": "Give three tips for staying healthy.", "input": "", "output": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule." }, { "instruction": "What are the three primary colors?", "input": "", "output": "The three primary colors are red, blue, and yellow." }, { "instruction": "Describe the structure of an atom.", "input": "", "output": "An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom." },
[x] backgroud, biz objective, context (shen)
[x] slides 模版提供 (gan)

Shaniaee commented 1 year ago

4.22 (Sat) 21:00

[ ] ui+cloud(shen)
[x] slides draft (all)
[x] model evaluation design (wq, yyc, gyq) 100 testing (not used in training), compare with gpt 2.5, gpt 4, Alpaca, about《Plato》?
[x] model evaluation (cql)

Shaniaee commented 1 year ago

4.25 (Tue) 21:00

[x] slides complete (all)

Shaniaee commented 1 year ago

4.27 Slides submission

QiaolingChen00 commented 1 year ago

for data generation（已废弃）

在这里生成自己的 API key 并复制 https://platform.openai.com/account/api-keys
打开命令行运行 pip install llama-index pip install langchain git clone https://github.com/irina1nik/context_data.git

新建一个python 文件命名为 test.py 把下面的代码粘进去


from llama_index import SimpleDirectoryReader, GPTListIndex, readers, GPTSimpleVectorIndex, LLMPredictor, PromptHelper, ServiceContext
from langchain import OpenAI
import sys
import os
from IPython.display import Markdown, display

def construct_index(directory_path):

set maximum input size

max_input_size = 4096
# set number of output tokens
num_outputs = 2000
# set maximum chunk overlap
max_chunk_overlap = 20
# set chunk size limit
chunk_size_limit = 600 

# define prompt helper
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

# define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0.5, model_name="text-davinci-003", max_tokens=num_outputs))

documents = SimpleDirectoryReader(directory_path).load_data()

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

index.save_to_disk('index.json')

return index

def ask_ai(): index = GPTSimpleVectorIndex.load_from_disk('index.json') while True: query = input("What do you want to ask? ") response = index.query(query) display(Markdown(f"Response: {response.response}"))

os.environ["OPENAI_API_KEY"] = input("Paste your OpenAI key here and hit enter:")

construct_index("context_data/data")

ask_ai()


4. 在命令行运行 `python test.py`

5. 根据提示输入第一步复制好的API key
6. 然后随便问一个问题 就会有返回

QiaolingChen00 commented 1 year ago

for data prepare 更新 (Davinci 昂贵版)

之前那个代码不能批量生成用下面的代码

首先在命令行输入 export OPENAI_API_KEY=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 注意这个 API key 是GanYuqing 的，记得换成自己的不然她就没得用了
然后保存下面的代码为 test.py 运行 python test.py
```
import openai
import os
import json
```

在 OpenAI 网站上创建 API key，并将其保存在环境变量中

openai.api_key = os.environ["OPENAI_API_KEY"]

准备要输入到 ChatGPT 模型的多个 prompt

prompts = [ "请你为我生成一段描述秋天的诗句：", "给我一个关于人生哲理的句子：", "写一句简短的情话：", ]

遍历多个 prompt 并使用 OpenAI API 发送请求并获取响应

results = [] for prompt in prompts: response = openai.Completion.create( engine="davinci", prompt=prompt, max_tokens=256, n=1, stop=None, temperature=0.7, )

# 从响应中获取 ChatGPT 生成的文本
generated_text = response.choices[0].text.strip()

# 将 prompt 和生成的文本组成一个字典并添加到 results 数组中
result = {
    "instruction": prompt,
    "input": "",
    "output": generated_text,
}
results.append(result)

将 results 数组转换成 JSON 格式并打印

json_data = json.dumps(results, ensure_ascii=False) print(json_data.encode('utf-8').decode('utf-8'))

保存 results 数组到 results.json 文件中

with open("results.json", "w", encoding="utf-8") as f: json.dump(results, f, ensure_ascii=False)

3. 其中可以在 下面的代码段换成自己的问题

prompts = [ "请你为我生成一段描述秋天的诗句：", "给我一个关于人生哲理的句子：", "写一句简短的情话：", ]

4. 这里有一些 prompt 实例， 可以试试这样生成
```python
 messages=[
                {"role": "system", "content": "You are a professor in the field of ["+self.key_word+"] who is good at questions asking and answering ,also good at summarizing papers using concise statements"},
                {"role": "assistant", "content": "This is the title, author, link, abstract and introduction of an English document. I need your help to read and answer the following quesions: "+clip_text},
                {"role": "user", "content": """                 
                 1. Mark the title of the paper (with Chinese translation)
                 2. list all the authors' names (use English)
                 3. mark the first author's affiliation (output {} translation only)                 
                 4. mark the keywords of this article (use English)
                 5. 用一个词描述主题是什么？并解释这个词的基本概念。(with Chinese translation)
                 6. 你是一个老师，并且很擅长 presentation，请你根据 5 提出的主题，生成一个 outline (with Chinese translation)
                 7. 你是一个助教，你需要给老师在 6 给出的 outline 讲解这篇文章。(with Chinese translation)
                 Follow the format of the output that follows:                  
                 1. Title: xxx\n\n
                 2. Authors: xxx\n\n
                 3. Affiliation: xxx\n\n                 
                 4. Keywords: xxx\n\n   
                 5. xxx \n\n      
                 6. 老师: \n\n
                    - (1)xxx;\n 
                        - detail: xxx;\n
                    - (2)xxx;\n 
                        - detail: xxx;\n
                    - (3)xxx;\n  
                        - detail: xxx;\n
                    xxx \n\n  
                 7. 助教: \n\n
                    - (1)xxx：
                        - xxx;\n 
                    - (2)xxx：
                        - xxx;\n 
                    - (3)xxx;\n   
                        - xxx \n\n

CherylQWong commented 1 year ago

Prompt for instruction generation

You are a potential book purchaser who are browsing the bookstore webs, you want to ask ChatGPT some questions(give GPT certain instructions) to know better about the book, the questions/instructions should meet the following requirements: 1.Written in English with 1 or 2 sentences, allowing the use of prayers or questions. 2.Try not repeat the verbs in each instruction and maximize the diversity of the instructions. 3.There should also be variety in the tone of the instructions used. 4.The types of instructions should be diverse, such as: brainstorming, open QA, closed QA, rewrite, extract, generation, classification, chat, and summarization. 5.The GPT language model should be able to complete these instructions. For example, the instructions should not be related to audio, video, pictures, or links, because the GPT model cannot perform this operation. 6.The questions should be with good depth and related to the book's concrete contents. You could refer to certain books like Plato's The Republic.

Here are some good examples: 1.Describe the main ideas explored in the book. 2.Analyze the main characters in the book and the roles they play. 3.What is the overall structure of the book? 4.Has the book had an influence on later philosophical or literary works? 5.Are there any specific passages or dialogues within the book that are particularly notable? 6.Are there any related works by the author or other similar authors that you might suggest to someone who enjoyed this book? 7.What historical or philosophical context does the book fit into?

Please directly list 50 questions/instructions without any other explainations.

CherylQWong commented 1 year ago

Generate outputs (GPT3.5-turbo 白菜价)

import openai
import os
import json
import pandas as pd

# 在 OpenAI 网站上创建 API key
openai.api_key =

# 准备要输入到 ChatGPT 模型的多个 prompt (每一个instruction都有intro)
instruct = pd.read_csv("/Users/wq/Desktop/plp_Dataset/Final_Instructions.txt",sep="?",header=None)
instruct = list(instruct[0])
instruct

intro = "You are the Chatbot of a book seller, a potential book purchaser want to ask you some questions to know better about the book: Republic written by Plato. Please answer these questions with good logic and languages(within 150 words), just like you are chatting with the purchaser:"

prompts = []
for i in instruct:
    i = intro + "  " + i
    prompts.append(i)

prompts

# 遍历多个 prompt 并使用 OpenAI API 发送请求并获取响应
results = []
for prompt in prompts:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        n=1,
        stop=None,
        temperature=0.7,
    )

    # 从响应中获取 ChatGPT 生成的文本
    generated_text = response["choices"][0]["message"]["content"]

    # 将 prompt 和生成的文本组成一个字典并添加到 results 数组中
    result = {
        "instruction": prompt,
        "input": "",
        "output": generated_text,
    }
    results.append(result)

# 将 results 数组转换成 JSON 格式并打印
json_data = json.dumps(results, ensure_ascii=False)
print(json_data.encode('utf-8').decode('utf-8'))

# 保存 results 数组到 results.json 文件中
with open("results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False)

QiaolingChen00 / NUS_plp_project

Plan #1

4.16 (Sun) 21:00

4.22 (Sat) 21:00

4.25 (Tue) 21:00

4.27 Slides submission

for data generation（已废弃）

set maximum input size

for data prepare 更新 (Davinci 昂贵版)

在 OpenAI 网站上创建 API key，并将其保存在环境变量中

准备要输入到 ChatGPT 模型的多个 prompt

遍历多个 prompt 并使用 OpenAI API 发送请求并获取响应

将 results 数组转换成 JSON 格式并打印

保存 results 数组到 results.json 文件中

Prompt for instruction generation

Generate outputs (GPT3.5-turbo 白菜价)