Add Alpaca Persian Dataset

LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.

https://open-assistant.io

Apache License 2.0

36.92k stars 3.22k forks source link

Add Alpaca Persian Dataset #3633

Open pourmand1376 opened 1 year ago

pourmand1376 commented 1 year ago

Hi, In the last two days, I have been working on translating alpaca into Persian (Farsi) and this is the result. I have reviewed the translations and they are in my opinion pretty good.

Also, the dataset is still translating on Kaggle and will be finished in a couple of days. I will update the datasets accordingly when the translation is complete.

I have added two datasets. One is instruction-based and one is orca-style dataset. For the first one, I knew how to add it. But I don't know how to add the orca dataset to your datasets.

Thank you for your attention.

stefangrotz commented 1 year ago

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

pourmand1376 commented 1 year ago

Hey great work, I always wanted translate this dataset to German or Esperanto. The main problem here is that the license of Alpaca isn't usable for Open Source LLMs because ChatGPT does not allow to use its output to train other models. Because of that it cannot be used for Open Assistant or for any commercial project.

However having this dataset surely is useful to train experimental systems and science projects.

BTW. do you know about the Alpaca Data Cleaned project? It fixed a lot of the errors in the dataset, like wrong calculations: https://github.com/gururise/AlpacaDataCleaned

Hi, Thanks for your comment.

Yes, I have used the cleaned version.

Sadly, I didn't know about license restrictions. The dataset itself (Alapaca) is published under Apache 2.0. I have also published my dataset under Apache 2.0.

Isn't that good enough?

stefangrotz commented 1 year ago

Unfortunately not, see https://github.com/gururise/AlpacaDataCleaned#license This is one of the main reasons why OA started to build up a crowd sourced conversational dataset.

Maybe you can translate the english and the spanish Open Assistant Dataset instead? Both are quite big. https://huggingface.co/datasets/OpenAssistant/oasst1