Closed opvexe closed 1 month ago
I also encountered the same problem, did you solve it?
I encountered the same problem.
I alse encountered the same problem. who do know to slove it?
Seems related to news that Google's Bard is using ShareGPT's data to train
https://www.theverge.com/2023/3/29/23662621/google-bard-chatgpt-sharegpt-training-denies
Though not sure how true the news is, current sharegpt does not have a place to "share" unless you directly paste the link to other social medias like Twitter.
Can you please say more about what you're trying to achieve here with this API? You're correct that this endpoint is disabled
Can you please say more about what you're trying to achieve here with this API? You're correct that this endpoint is disabled
I am a student and want to call api to get some data for my research.
I am a student and want to call api to get some data for my research.
Can you please say more about what you're trying to achieve here with this API? You're correct that this endpoint is disabled
I am a student and want to call api to get some data for my research.
@domeccleston It's a new trend now. Download 2/3 synthetic dataset + train on Llama = DIY CHADGPT
@domeccleston It's a new trend now. Download 2/3 synthetic dataset + train on Llama = DIY CHADGPT
Yes, these data can greatly help us ordinary people to achieve ChatGPT.
@domeccleston can open it ?
It's a new trend now. Download 2/3 synthetic dataset + train on Llama = DIY CHADGPT
Can you link me a guide that walks me through the exact steps to do this?
I need more context here. Please link me something or email domeccleston@gmail.com.
Can't promise anything, but if you help me understand why this data is valuable to you, I can evaluate.
Can't promise anything, but if you help me understand why this data is valuable to you, I can evaluate.
These data can help us train ChatGPT on our own devices, which will facilitate the democratization of AI. can open it ?
Can't promise anything, but if you help me understand why this data is valuable to you, I can evaluate.
These data can help us turn Close AI into a real Open AI.
@domeccleston let me give you some of the resources for my thesis. https://vicuna.lmsys.org/ , https://github.com/nomic-ai/gpt4all, https://crfm.stanford.edu/2023/03/13/alpaca.html , we are all trying to test these researches and do evaluation
@domeccleston We really need this data, otherwise it will only make AI be not open
+1 This data is really valuable. If you could host a data dump that will be really helpful.
@domeccleston +1 This data is really valuable. If you could host a data dump that will be really helpful.
@Lisennlp, @ari9dam, @shumintao, @Ejafa, @genggui001, @chinoll
I understand there may be reasons for ShareGPT to close the endpoint. But for those looking for what appears to be a similar datasets, GPT4all has an extended training set under an Apache license. From what I can tell, it looks like it contains some prompts from GPT-3 (maybe GPT-3-Turbo). You may just have to preprocess it using a similar approach mentioned under GPT4ALL's technical report before training your model (fresh LLaMa or Vicuna model, for example).
GPT4ALL: https://github.com/nomic-ai/gpt4all Technical report: https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf GPT4ALL extended training dataset: https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations_with_p3
YI, There is a difference between GPT4all, Alpacca datasets and ShareGPT. ShareGPT is a "multi-turn" dialogue dataset, generated from diverse users. While others are one "single-interaction" between Human and GPT.
@Lisennlp, @ari9dam, @shumintao, @Ejafa, @genggui001, @chinoll
I understand there may be reasons for ShareGPT to close the endpoint. But for those looking for what appears to be a similar datasets, GPT4all has an extended training set under an Apache license. From what I can tell, it looks like it contains some prompts from GPT-3 (maybe GPT-3-Turbo). You may just have to preprocess it using a similar approach mentioned under GPT4ALL's technical report before training your model (fresh LLaMa or Vicuna model, for example).
GPT4ALL: https://github.com/nomic-ai/gpt4all Technical report: https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf GPT4ALL extended training dataset: https://huggingface.co/datasets/nomic-ai/gpt4all_prompt_generations_with_p3
Only ShareGPT's data is multi-round and has real interaction with humans. If you really can't enable the api, sharing a data dump is also great.
Sharing the data dump is actually better.
I agree, a data dump would be the best alternative of all this. It's ultimately up to the creators of ShareGPT I suppose, as there may be security concerns (not all who shared conversations may have done so publicly).
If not, the next step would be for others to create a cloned version of ShareGPT and give it some time to grow similar in size. It does suck that some get access to this type of public data, only for it to no longer be public after release. It will further make others more secretive of their data practices.
+1 I also think that this data would greatly help improve real open source models.
I agree with that. I believe this is the only prompt data used for ChatGPT
Hi, also a researcher here. If the data of this repo could be shared such as in the form of a data dump, it will greatly contribute to the understanding of how contemporary people interact with generative AIs. For example, What their primary goals are and what kind of problems they run into when interacting with AIs. The known problems of hallucination and stigmatized answers pose threat to development of AI models. However, this type of knowledge is proprietary to OpenAI.
@ari9dam May I ask a question. Even the shareGPT data is multi-turn
, when we convert it into data that could be finetuned. We still need to change it to instruction/input/output
format which means input is a sentence and output is a sentence. Is there a good way to train dialogs?
Here's the example of what I mean
user: I want you to act as a resume editor. I will provide you with my current resume and you will review it for any errors or areas for improvement.....
gpt4: Sure, I'd be happy to help you with your resume! Please send me the resume so I can begin reviewing it.
user: xxx University of California, Berkeley xxxCo-Founder, Markit.ai Co. xxx
gpt4: Here are my suggestions for your resume: xxxGeneral:
user: xxxx
gpt: yyyy
In order to make it trainable by alpaca etc. we need to change to single turn instead of multi-turn
Instruction: I want you to act as a resume editor. I will provide you with my current resume and you
input: xxx University of California, Berkeley xxxCo-Founder, Markit.ai Co. xxx
output: Here are my suggestions for your resume: xxxGeneral:
@genggui001 I found this one as well https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
This is quite amazing. This isn't even considering the effort by others to likely scrape twitter for ShareGPT links.
If anyone wants to take a stab at scraping this site, they have around 80k GPT conversations:
It's somewhat more unstructured than ShareGPT, but it has a lot of stuff in it too.
I thought that if I didn't open up the conversation data, I might fork a branch, make a new sharegpt, and share the data.
This is an open-source project. People share their conversations here, intending for them to be viewed by the public. The dataset is not personal property!
@shumintao Please go for it. We might easily gain supports from many research institute.
+1 ! The data would be amazing for different tinkering tasks.
GET /api/conversations (for fetching conversations) Can you open it?