Closed Blue7771 closed 3 years ago
I know that this will be controversial but my suggestion would be to assign this compilation something like a 10% weight in the Pile. I think that these books are the most effective counter to the meme-like and juvenile behavior that GPT-3 often exhibits because of the low quality of its training material. I am not sure that the Pile is currently doing enough to counter it.
If we are to define maturity as being the opposite of juvenile behaviour, then increasing the quantity, quality, and weight of the resources that can be viewed as defining mature conduct could be the only way of improving the future model's behaviour over GPT-3.
Truth be told, such a compilation should be researched at increasing weights, such as 10%, 50%, and 80%, to see whether and how behaviour improves.
Thank you for the suggestion. If you follow the instructions in this repo and process the data we will look at including it in in our future work. Unfortunately we are not taking new additions to the Pile at this time however.
It is technologically infeasible to scale this dataset to 10% of the Pile due to its small size. Doing so would give these words weight 1000x that of the original text which will cause heavy distortions.
I don't have coding experience, so that is unfortunately not possible for me.
However my point still stands. If 10% were to cause distortions, then a weight and epoch that would be just below that which would cause distortions should be found and used.
This could be the only way to counter GPT-3-like behaviour.
This could be the only way to counter GPT-3-like behaviour.
It could be. It could not be. But on balance it seems almost surely false that familiarity with Buddhist teachings is a necessary and sufficient condition for developing an AI aligned to human values. If someone wishes to implement the dataset that would be lovely, but until someone expresses such interest I am going to close this issue.
A compilation of possibly the best Early Buddhist resources in English that are legally available for free distribution. Around 130MB. GPT-Neo might end up benefitting in unexpected ways from this compilation.
https://drive.google.com/file/d/1kU4xaonF_Kp6ZM3d5y51Q3BYddqJmlkf/view?usp=sharing