AGIHouse/openscience - Githubissues

Open Science Mission

Create an open-source datastore of research data built to back research powered Generative AI applications. Our first datasource will be Arxiv / Bioarxiv. We will power clean research datastore creation and access, AI augmented research processes and fully autonomous generative research.

Tiers:

Cleaned Dataset (dataset; ETL tool) with papers standardized in JSON. LLM Training.
Research exploration & retrieval (tool): Search for relevant research.
AI Augmented Research Process (product): Research Paper Writing (http://arxivgen.com), From prompt. From data. Automated research reviews & criticism Paper Reviews. Proposal Reviews. Automating the evaluation of ‘new’ knowledge. Experiment writeups. Evaluation of scientific claims. Research creativity. Generating novel methods. Generating novel hypotheses. Generating novel explanations for results. Generating novel evaluation methodology.
Generative Research (product) Knowledge generating agents & algorithms.

Deliverables (Output):

Open Source Github Repo with tools to scrape/parse sources of scientific data.
Repositories of data which are legally easy to host.
It should allow contributing developers to easily add research functionality, especially ETL functionality. We’ll enable researchers to upload embeddings and search methods for new sources of data.

Unified JSON Schema & API

We will publish a new, code friendly JSON representation of all papers in OpenScience. This will make it simple to access all paper text and metadata through an easy-to-use API, and will ensure that all data formats are standardized.

Data to collect & harmonize:

The full text of all arxiv papers.
Passage embeddings for the full text of all arxiv papers.
Paper author metadata.
Paper title data.
Paper citation data.
Publication date.

Modular Open Source Toolkit

While pursuing a unified JSON representation, we will make a modular, reusable toolkit which we will open source for broader use.

Toolkit Components:

Scraping scripts: This pulls raw data into a mount or folder.
Parsing scripts: Likely dataset conditional parsers that turn raw data into passages and additional metadata. Missing metadata will be handled in the database. Varying segmentation strategies / rules are allowed. Ex., token count with overlap, sentence by sentence with overlap, passage by passage. Latex -> json.gz Write a schema with important paper metadata in the json files.
Database: Space for similar metadata across data sources. Full Text. Passages. Typed based on their segmentation strategy. Authors. Title. Citations (Of other papers). Human readable. Unique ID within this database. Citation String. Passage Embeddings (Including various types: all-mpnet-base, ada-002, MiniLM). Paper Tags (ex., Arxiv categories like Information Retrieval or Biomolecules). Source.
Database API Retrieval functions for paper data.
Embedding Service Likely Fargate / Lambda auto-scaling service.
Retrieval Service Approximate Nearest Neighbors Index w/ API
Data Sources

First Stage Data Sources

We will collect: Arxiv, Bioarxiv.

Second Stage Data Sources

In our second stage, we would like to collect: chemarxiv, Pubmed, JStor, Nature, Science, Springer, ScienceDirect, Academic Torrents, Other datasets.

Existing Relevant Tools and Companies

Semantic Scholar, Elicit, OpenSyllabus, sCite, Kaggle, Galactica, Metaphor, Alexandria Embeddings, Connected Papers

Examples of knowledge generation:

Deep Learning Guided Discovery of an antibiotic targeting Acinetobacter baumannii Discovering New Interpretable Conservation Laws as Sparse Invariants

Resources:

Alexandria Embeddings (https://huggingface.co/datasets/macrocosm/arxiv_abstracts), Arxiv sanity lite (https://arxiv-sanity-lite.com/), ArxivGen (http://arxivgen.com/), TXYZ (https://txyz.ai/)

Legal Disclaimers

On Research Data IP: We do not store the exact text of research data. We store code that allows users to download and transform data in to a useful format. We store only embeddings of research data, and processes for using those embeddings for search.