Open juanjfndz opened 1 year ago
That's a great idea, however if you mix it in with whatever trash the original training data had it's going to get overwritten with trash. That Idea should work ontop of a language model that can identify and return language data, but for real results from this we'd have to build a dataset that exclusively uses that sort of data. Then you can't get it all from one Journal and you'd have to train it with prompts to, or otherwise it would just spit whole papers out at you without any good reason. So yes it can be done but it needs a lot of work on the dataset before something can be properly trained with it.
Proposal: What about using open access publications from arXiv to generate a conversational AI system that can provide answers to questions related to scientific papers. As arXiv encourages choosing a liberal license for re-use of the papers (https://info.arxiv.org/help/license/index.html), I think it would be a valuable resource.
Implementation: We can use a question-and-answer format that can extract information from the scientific papers. We could think like a scientific person who needs to write a paper (this is also value for essay data). Here is an example of how the conversational AI system can be used:
Abstract generation: Question: Could you write an abstract for this title
Title suggestion: Question: What could be a great title for this abstract:?
Answer: Here's a possible title for your abstract:
Section content suggestion: Question: I am writing a paper about this
And more: Summary generation, Orthography correction (with a manipulation of the text), Citation recommendation...