Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
651 stars 264 forks source link

Large language Model: Development #1818

Closed a00110110011 closed 5 months ago

a00110110011 commented 5 months ago

I am looking to add the data from the monogdb to a existing model, PHI 2 from Microsoft. I will be using the uncensored version, as most models when it comes to religion have built in blocks to NOT talk about religion, and if they do they are biased against the Torah and concepts. They tend to not answer and state that this information goes against the policies of the LLM. To show bias, you can run a simple test, " my wife killed my dog! what do I do?" and a follow up question " my husband killed my dog! what do I do?" just this question by itself shows that LLMs are biased. This will not work with Torah content, as it is all going to be blocked. The only real way to get Torah on the system will be to built our own.

I have started by preparing dates to add to the dataset, so I have all the dates from year 0 and their corresponding goy dates, and muslim dates. the next step will be to prepare the dataset for the training model to learn from the Sefaria library.

I have a few questions for anyone else who might be interested in helping me out in this project. Does anyone have an idea for a better model to use other then PHI 2? would anyone else be interested in helping prepare datasets with me? this is a huge undertaking and requires allot of preparation before we can actually start training the model. we need, the model needs the original text, an explanation of the text, rules for the text, etc. I was planning to start "rules up" meaning that we start training the LLM from fixed rules, like Rambams laws. These are largely accepted as the base of the Observable Torah nad have daily impact on day to day, I was then thinking to do the same with Shulchan Orech" once all the datasets are inserted we can move on to the commentary, train the LLM from alternative opinions of the Rabbis.

Another training method I would like to do is to train the model with "Skip codes" (gematria) , All this is stage one Stage 2-20: Progressive Training Expand to Other Halachic Texts:

Include other fundamental halachic texts like Mishnah Berurah, Aruch HaShulchan, etc. Incorporate various legal opinions and interpretations to provide a comprehensive understanding. Incorporate Aggadah and Midrash:

Introduce narrative and non-legalistic texts to capture the storytelling aspects of the Torah. Include Midrashim and Aggadot for a more holistic view. Historical Context:

Integrate historical texts and contextual information to provide a better understanding of the circumstances in which certain laws or events occurred. Ethical and Philosophical Texts:

Include works that delve into ethical and philosophical discussions within Judaism. This can include works like the Mesilat Yesharim, Guide for the Perplexed, etc. Kabbalistic Texts:

If your audience is interested, you might consider introducing Kabbalistic texts gradually. This includes works like the Zohar and other Kabbalistic commentaries. Comparative Religion:

Include information about comparative religions, but be careful to maintain sensitivity and accuracy. Modern Halachic Decisions:

Integrate modern halachic decisions and responsa to bridge the gap between traditional texts and contemporary issues. Customs and Minhagim:

Explore the customs and traditions of different Jewish communities. Q&A from Responsa Literature:

Train the model on Q&A sessions from responsa literature to address specific scenarios and questions. Incorporate Hebrew and Aramaic Language Understanding:

Improve the model's understanding of Hebrew and Aramaic language intricacies to handle original texts more accurately. How You Can Contribute:

If you have expertise in language models or AI, your input on model selection and training methodologies would be invaluable. Torah scholars and enthusiasts, your knowledge in providing accurate datasets and rule-based training is crucial. Connect with Me: Feel free to reach out on WhatsApp at +972559958440. Your collaboration could bring Torah wisdom to the forefront of AI, fostering a richer understanding for everyone.

Let's make this project a collective success! 🤝📖

Warm regards, Moshe Israel

saengel commented 5 months ago

Hi @a00110110011, thank you so much for your interest in working with Sefaria's data. This sounds like an interesting project. Please feel free to reach out with any specific issues you come across. Best to contact us at developers@sefaria.org for these types of inquiries moving forward. Thanks!