LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.08k stars 3.24k forks source link

Proposal: Use scientific papers for data. #1927

Open juanjfndz opened 1 year ago

juanjfndz commented 1 year ago

Proposal: What about using open access publications from arXiv to generate a conversational AI system that can provide answers to questions related to scientific papers. As arXiv encourages choosing a liberal license for re-use of the papers (https://info.arxiv.org/help/license/index.html), I think it would be a valuable resource.

Implementation: We can use a question-and-answer format that can extract information from the scientific papers. We could think like a scientific person who needs to write a paper (this is also value for essay data). Here is an example of how the conversational AI system can be used:

Abstract generation: Question: Could you write an abstract for this title ? Answer: Certainly! Here's an example of a possible abstract: <abstract></p> <p>Title suggestion: Question: What could be a great title for this abstract: <abstract>? Answer: Here's a possible title for your abstract: <title></p> <p>Section content suggestion: Question: I am writing a paper about this <title>. I have talked about these previous sections <previous-sections> and now I have to write about this <section>. Could you give me an example? Answer: Certainly! Here's an example of what you can write for your section: <content of the section></p> <p>And more: Summary generation, Orthography correction (with a manipulation of the text), Citation recommendation...</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Archibajl"><img src="https://avatars.githubusercontent.com/u/27750957?v=4" />Archibajl</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>That's a great idea, however if you mix it in with whatever trash the original training data had it's going to get overwritten with trash. That Idea should work ontop of a language model that can identify and return language data, but for real results from this we'd have to build a dataset that exclusively uses that sort of data. Then you can't get it all from one Journal and you'd have to train it with prompts to, or otherwise it would just spit whole papers out at you without any good reason. So yes it can be done but it needs a lot of work on the dataset before something can be properly trained with it.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>