AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
https://selfrag.github.io/
MIT License
1.59k stars 140 forks source link

请问有中文训练语料吗? #36

Closed mawenju203 closed 3 months ago

fate-ubw commented 5 months ago

Definitely not, self rag only provides English data. If you need Chinese training data, you need to go through the train data creation process

mawenju203 commented 5 months ago

https://huggingface.co/datasets/selfrag/selfrag_train_data

请问一下,这个数据的处理过程以及原始数据,有吗?

另外有个问题

image

伪代码中的实现是就是这种形式的吗?

from passage_retriever import Retriever
retriever = Retriever({})
retriever.setup_retriever_demo("facebook/contriever-msmarco", "enwiki_2020_intro_only/enwiki_2020_dec_intro_only.jsonl", "enwiki_2020_intro_only/enwiki_dec_2020_contriever_intro/*",  n_docs=5, save_or_load_index=False)
retrieved_documents = retriever.search_document_demo(query_3, 5)
prompts = [format_prompt(query_3, doc["title"] +"\n"+ doc["text"]) for doc in retrieved_documents]
preds = model.generate(prompts, sampling_params)
top_doc = retriever.search_document_demo(query_3, 1)[0]
print("Reference: {0}\nModel prediction: {1}".format(top_doc["title"] + "\n" + top_doc["text"], preds[0].outputs[0].text))
mawenju203 commented 5 months ago

Definitely not, self rag only provides English data. If you need Chinese training data, you need to go through the train data creation process

谢谢,

AkariAsai commented 5 months ago

Hi thank you so much for answering the question, @fate-ubw (I just answered your question, by the way!)

@mawenju203 Hi thanks for your interest. We don't have any Chinese training data. Would be exciting to see Self-RAG applications to other languages, though!

Regarding the second question (I used Google translate, and it said you asked if the demo code is the same as the pseudo-code), the code snippet is a simple interface to run Self-RAG, so it's not the same as the original inference logic. If you are interested, please take a look at the run_long_form_static.py script.

Aman-4-Real commented 4 months ago

I have trained a Chinese version of Self-RAG based on Baichuan2-7B-Chat, which you can download from here. All the reflection tokens are the same as the English version. I hope you find this helpful :).

AkariAsai commented 3 months ago

Thank you so much for the info! Great to hear people tested Self-RAG in other languages :) I'm closing this issue now but feel free to reopen it!

hummingbird2030 commented 2 months ago

I have trained a Chinese version of Self-RAG based on Baichuan2-7B-Chat, which you can download from here. All the reflection tokens are the same as the English version. I hope you find this helpful :).

Thanks for your great work! Could you provide Chinese training data?

Aman-4-Real commented 2 months ago

I have trained a Chinese version of Self-RAG based on Baichuan2-7B-Chat, which you can download from here. All the reflection tokens are the same as the English version. I hope you find this helpful :).

Thanks for your great work! Could you provide Chinese training data?

Yes! I just now uploaded a file containing 4w constructed data, which you can find and download from huggingface.

Youngphone commented 2 months ago

I have trained a Chinese version of Self-RAG based on Baichuan2-7B-Chat, which you can download from here. All the reflection tokens are the same as the English version. I hope you find this helpful :).

Thanks for your great work! Could you provide Chinese training data?

Yes! I just now uploaded a file containing 4w constructed data, which you can find and download from huggingface.

Thanks for your great work !
Could you provide the code you used to construct data and trian selfrag-zh_baichuan2_7b_chat ?

Aman-4-Real commented 1 month ago

I have trained a Chinese version of Self-RAG based on Baichuan2-7B-Chat, which you can download from here. All the reflection tokens are the same as the English version. I hope you find this helpful :).

Thanks for your great work! Could you provide Chinese training data?

Yes! I just now uploaded a file containing 4w constructed data, which you can find and download from huggingface.

Thanks for your great work ! Could you provide the code you used to construct data and trian selfrag-zh_baichuan2_7b_chat ?

I just used the original data creation pipeline in this repo, by following which you can apply to your own sft datasets.