Scientific Paper Search and Comprehension

zachary-kaelan commented 1 year ago

Duplicates

[X] I have searched the existing issues

Summary 💡

The agent has a command to search for scientific papers, which finds any matching the query that have the actual text of the paper publicly available somewhere, and returns a list of them and their abstracts. This could be done with a vectors database of embedded abstracts.

A second command is used to download the paper and convert it from PDF to a text format. Unfortunately papers containing math equations have them in LaTeX-generated images, which require some fancy OCR to extract back to LaTeX.

Science papers are also often bigger than the context window so they would need to be condensed section-by-section into something that can fit into a prompt and preserves sufficient information for GPT.

Examples 🌈

GPT-3.5 doesn't really "get" text condensation and basically just summarizes: excluding crucial math equations while refusing to compromise readability. Davinci does much better but needs a good prompt and is almost as expensive as GPT-4.

Here's what I used:

Convert this excerpt from a scientific paper - which I have converted to LaTeX - to a condensed telegraphic style that avoids use of definite or indefinite articles, punctuation, and other words unnecessary for comprehension of the text. Compress it to the fewest number of words possible while retaining enough information for you (GPT) to reconstruct the text later.

There's probably a better way to do this, but I don't know what that is.

Motivation 🔦

As noted in Significant-Gravitas/Auto-GPT-Plugins#34, Auto-GPT currently cannot handle PDFs. The software contractor company I work for targets a variety of unique, difficult problems, and every project has a "References" folder filled with scientific papers, all in PDF format. Unrestricted access to scientific literature will be mandatory for Auto-GPT to reach optimal performance on more complex tasks.

Androbin commented 1 year ago

Open Access Paper PDF Retrieval:

Semantic Scholar API: https://api.semanticscholar.org/api-docs/graph
Internet Archive Scholar API: https://scholar.archive.org/api/redoc

Text/Metadata Extraction from PDF:

PyMuPDF: https://pymupdf.readthedocs.io/en/latest/
GROBID: https://grobid.readthedocs.io/en/latest/

Vector embedding of paper content:

SPECTER: https://github.com/allenai/specter

Auto-generated short summary of the paper:

SciTLDR: https://github.com/allenai/scitldr

ntindle commented 1 year ago

Duplicate Significant-Gravitas/Auto-GPT-Plugins#34

Androbin commented 1 year ago

@ntindle In addition to reading PDFs, this ticket wants to integrate a search engine for scientific papers, and also focuses on comprehension.

ntindle commented 1 year ago

Migrate to Auto-GPT-Plugins

zachary-kaelan commented 1 year ago

@Androbin That's an excellent list, thank you! And I noticed that Semantic Scholar has the SPECTER embeddings and SciTLDR summaries for the papers, which saves the amount of stuff we have to do. I'll have to get in contact with them and make an updated issue in the plugins repo.

Androbin commented 1 year ago

@zachary-kaelan Please note that

the search endpoint does not return the embedding and tldr fields, only the details endpoint does
even the details endpoint does not always return an abstract and/or tldr
- https://github.com/allenai/s2-folks/issues/13
the embedding returned by the API is not compatible with the open source SPECTER model, so you can compare two papers but not a paper to a query

Significant-Gravitas / AutoGPT