The AI Alignment Research Dataset is a collection of documents related to AI Alignment and Safety from various books, research papers, and alignment related blog posts. This is a work in progress. Components are still undergoing a cleaning process to be updated more regularly. The most current version is available on HuggingFace StampyAI/alignment-research-dataset. This repository is the code to reproduce it.
Here are the list of sources along with sample contents:
agisf - recommended readings from AGI Safety Fundamentals
aisafety.info - Stampy's FAQ
arxiv - relevant research papers
blogs - entire websites automatically scraped
eaforum - selected posts
lesswrong - selected posts
special_docs - individual documents curated from various resources
youtube - playlists & channels
All entries contain the following keys:
id
- string of unique identifiersource
- string of data source listed abovetitle
- string of document title of documentauthors
- list of stringstext
- full text of document contenturl
- string of valid link to text contentdate_published
- in UTC formatAdditional keys may be available depending on the source document.
git clone https://github.com/StampyAI/alignment-research-dataset
cd alignment-research-dataset
Duplicate the provided .env.example
to create your environment configuration:
cp .env.example .env
This .env
file contains placeholders for several configuration options. Further details about how to configure them are in the Configuration section.
pip install -r requirements.txt
Optional: For testing purposes, you can also install testing dependencies:
pip install -r requirements-test.txt
Initialize a MySQL database. To do so with Docker, and spin up a container with the database initialised, run the following:
./local_db.sh
Various subcomponents in this project rely on external services, so need credentials set. This is done via environment variables. The file .env
is the central location for these settings.
The log level can be configured with the LOG_LEVEL
environment variable. The default level is 'WARNING'.
To update the stampy portion of the dataset, you will need a Coda token. Follow these instructions: 1. Go to coda.io 2. Create an account and log in 3. Go to the API SETTINGS section of your account settings, and select Generate API token
. Give your API token a name, and add the following restrictions: 1. Type of restriction: Doc or table 2. Type of access: Read only 3. Doc or table to grant access to: https://coda.io/d/_dfau7sl2hmG 4. Copy this token to your .env
file: CODA_TOKEN="<coda_token>"
It will be then accessible in align_data/stampy/stampy.py
.
The datasets are stored in MySQL. The connection string can be configured via the ARD_DB_USER
,
ARD_DB_PASSWORD
, ARD_DB_HOST
, ARD_DB_PORT
and ARD_DB_NAME
environment variables in .env
. A local
database can be started in Docker by running
./local_db.sh
For Pinecone updates to work, you'll need to configure the API key:
PINECONE_INDEX_NAME
) with the dotproduct
metric and 1536
dimensionsPINECONE_API_KEY
to the key from step 1PINECONE_ENVIRONMENT
to whatever is the environment of your indexTo autopopulate the metadata files, you'll need Google Cloud credentials. This is a google system, so of course is complicated and prone to arbitrary changes, but as of writing this the process is:
+ Create Credentials
.Create and Continue
+ Create Credentials
, select API key, and add the created API key as your YOUTUBE_API_KEY
.Once you have working credentials, you will be able to fetch data from public sheets and gdrive. For writing to sheets and drives, or accessing private ones within the code, you will need to request permissions to the owner of the particular sheet/gdrive.
There are a couple of datasources that consist of singular articles (html, pdfs, ebooks, etc), rather than all the contents of a given website. These are managed in Google sheets. It's assumed that the contents of that document are clean, in that all required fields are set, and that there is a source_url
pointing to a valid document. Rather than having to manually fill these fields, there is a magical script that automatically populates them from a messy input worksheet, which contains all kinds of info.
.env
, as OPENAI_API_KEY
.The airtable we currently scrape is https://airtable.com/appbiNKDcn1sGPGOG/shro9Bx4f2i6QgtTM/tblSicSC1u6Ifddrq. #TODO: document how this is done / reproduceable
To run tests, from root directory run:
pytest .
There are various commands available to interact with the datasets:
Access the MySQL database in a separate terminal before running most commands:
./local_db.sh
Listing all datasets:
python main.py list
Fetching a specific dataset:
Replace [DATASET_NAME]
with the desired dataset. The optional --rebuild
parameter allows you to remove the previous build before running, scraping everything from scratch. Otherwise, only the new files will be scraped.
python main.py fetch [DATASET_NAME] --rebuild
Fetching all datasets:
Again, the optional --rebuild
parameter allows you to scrape everything from scratch.
python main.py fetch-all --rebuild
Getting a summary of a merged dataset:
Replace [MERGED_DATASET_PATH]
with your dataset's path. You'll get access to the dataset's total token count, word count and character count.
python main.py count-tokens [MERGED_DATASET_PATH]
Updating the metadata in the metadata spreadsheet: You can give the command optional information about the names and ids of the sheets, and the default will be using values defined in align_data/settings.py
python main.py update_metadata
python main.py update_metadata <input spreadsheet id> <input sheet name> <output spreadsheet id>
Updating the pinecone index with newly modified entries:
Replace [DATASET_NAME]
with one or many dataset names whose entries you want to embed and add to the pinecone index.
--force_update
is an optional parameter for updating all the dataset's articles, rather than newly fetched ones.
python main.py pinecone_update [DATASET_NAME] --force_update
Or run it on all articles as seen below. It is not recommended to --force_update
in this case.
python main.py pinecone_update_all
Adding a new dataset consists of:
AlignmentDataset
to implement any additional functionality needed, within align_data/sources/DATASET_REGISTRY
so it can be foundThis is the main workhorse for processing datasets. The basic idea is that it provides a list of items to be processed, and after processing a given item, creates an article object, which is added to the MySQL database. The AlignmentDataset
class has various methods that can be implemented to handle various cases. A few assumptions are made as to the data it will use, i.e.:
self.data_path
is where data will be written to and read from - by default it's the data/
directoryself.raw_data_path
is where downloaded files etc. should go - by default it's the data/raw
directoryself.files_path
is where data to be processed is expected to be. This is used e.g. when a collection of html files are to be processedThe AlignmentDataset
is a dataclass, so it has a couple of settings that control it:
name
- this is a string that identifies the dataset, i.e. 'lesswrong'done_key
- used to check if a given item has already been processed.COOLDOWN
- an optional value of the amount of seconds to wait between processing items - this is useful e.g. when fetching items from an API in order to avoid triggering rate limitsThe basic processing flow is:
self.setup()
- any instance level initialization stuff should go here, e.g. fetching zip files with dataself._load_outputted_items()
- goes through articles in the database, loads the value of their self.done_key
, and outputs a simplified version of these strings using normalize_url
self.items_list
- returns a list of items to be processed.self.fetch_entries()
- for each of the resulting items:self.get_item_key(item)
self.process_entry(item)
to get an article, which is then yieldedThere are Datasets defined for various types of data sources - first check if any of them match your use case. If so, it's just a matter of adding a new entry to the __init__.py
module of the appropriate data source. If not, you'll have to add your own one - use the prexisting ones as examples. Either way, you should end up with an instance of an AlignmentDataset
subclass added to one of the registries. If you add a new registry, make sure to add it to align_data.DATASET_REGISTRY
.
The scraper code and dataset are maintained by StampyAI / AI Safety Info. Learn more or join us on Rob Miles AI Discord server.
The code is based on https://github.com/moirage/alignment-research-dataset. You can download version 1.0 of the dataset here. For more information, here is the paper and LessWrong post. Please use the following citation when using the dataset:
Kirchner, J. H., Smith, L., Thibodeau, J., McDonnell, K., and Reynolds, L. "Understanding AI alignment research: A Systematic Analysis." arXiv preprint arXiv:2022.4338861 (2022).