Closed zachary-kaelan closed 1 year ago
Open Access Paper PDF Retrieval:
Text/Metadata Extraction from PDF:
Vector embedding of paper content:
Auto-generated short summary of the paper:
Duplicate Significant-Gravitas/Auto-GPT-Plugins#34
@ntindle In addition to reading PDFs, this ticket wants to integrate a search engine for scientific papers, and also focuses on comprehension.
Migrate to Auto-GPT-Plugins
@Androbin That's an excellent list, thank you! And I noticed that Semantic Scholar has the SPECTER embeddings and SciTLDR summaries for the papers, which saves the amount of stuff we have to do. I'll have to get in contact with them and make an updated issue in the plugins repo.
@zachary-kaelan Please note that
embedding
and tldr
fields, only the details endpoint doesabstract
and/or tldr
embedding
returned by the API is not compatible with the open source SPECTER model, so you can compare two papers but not a paper to a query
Duplicates
Summary 💡
The agent has a command to search for scientific papers, which finds any matching the query that have the actual text of the paper publicly available somewhere, and returns a list of them and their abstracts. This could be done with a vectors database of embedded abstracts.
A second command is used to download the paper and convert it from PDF to a text format. Unfortunately papers containing math equations have them in LaTeX-generated images, which require some fancy OCR to extract back to LaTeX.
Science papers are also often bigger than the context window so they would need to be condensed section-by-section into something that can fit into a prompt and preserves sufficient information for GPT.
Examples 🌈
GPT-3.5 doesn't really "get" text condensation and basically just summarizes: excluding crucial math equations while refusing to compromise readability. Davinci does much better but needs a good prompt and is almost as expensive as GPT-4.
Here's what I used:
Convert this excerpt from a scientific paper - which I have converted to LaTeX - to a condensed telegraphic style that avoids use of definite or indefinite articles, punctuation, and other words unnecessary for comprehension of the text. Compress it to the fewest number of words possible while retaining enough information for you (GPT) to reconstruct the text later.
There's probably a better way to do this, but I don't know what that is.
Motivation 🔦
As noted in Significant-Gravitas/Auto-GPT-Plugins#34, Auto-GPT currently cannot handle PDFs. The software contractor company I work for targets a variety of unique, difficult problems, and every project has a "References" folder filled with scientific papers, all in PDF format. Unrestricted access to scientific literature will be mandatory for Auto-GPT to reach optimal performance on more complex tasks.