StampyAI / alignment-research-dataset

Stampy's copy of Alignment Research Dataset scraper
https://huggingface.co/datasets/StampyAI/alignment-research-dataset
MIT License
9 stars 7 forks source link

AI Alignment Research Dataset

The AI Alignment Research Dataset is a collection of documents related to AI Alignment and Safety from various books, research papers, and alignment related blog posts. This is a work in progress. Components are still undergoing a cleaning process to be updated more regularly. The most current version is available on HuggingFace StampyAI/alignment-research-dataset. This repository is the code to reproduce it.

Sources

Here are the list of sources along with sample contents:

Keys

All entries contain the following keys:

Additional keys may be available depending on the source document.

Development Environment

1. Clone the repository:

git clone https://github.com/StampyAI/alignment-research-dataset
cd alignment-research-dataset

2. Set up Environment Variables:

Duplicate the provided .env.example to create your environment configuration:

cp .env.example .env

This .env file contains placeholders for several configuration options. Further details about how to configure them are in the Configuration section.

3. Install Dependencies:

pip install -r requirements.txt

Optional: For testing purposes, you can also install testing dependencies:

pip install -r requirements-test.txt

4. Database Setup:

Initialize a MySQL database. To do so with Docker, and spin up a container with the database initialised, run the following:

./local_db.sh

Configuration

Various subcomponents in this project rely on external services, so need credentials set. This is done via environment variables. The file .env is the central location for these settings.

Logging

The log level can be configured with the LOG_LEVEL environment variable. The default level is 'WARNING'.

Coda

To update the stampy portion of the dataset, you will need a Coda token. Follow these instructions: 1. Go to coda.io 2. Create an account and log in 3. Go to the API SETTINGS section of your account settings, and select Generate API token. Give your API token a name, and add the following restrictions: 1. Type of restriction: Doc or table 2. Type of access: Read only 3. Doc or table to grant access to: https://coda.io/d/_dfau7sl2hmG 4. Copy this token to your .env file: CODA_TOKEN="<coda_token>" It will be then accessible in align_data/stampy/stampy.py.

MySQL

The datasets are stored in MySQL. The connection string can be configured via the ARD_DB_USER, ARD_DB_PASSWORD, ARD_DB_HOST, ARD_DB_PORT and ARD_DB_NAME environment variables in .env. A local database can be started in Docker by running

./local_db.sh

Pinecone

For Pinecone updates to work, you'll need to configure the API key:

  1. Get an API key, as described here
  2. Create a Pinecone index named "stampy-chat-ard" (or whatever is set as PINECONE_INDEX_NAME) with the dotproduct metric and 1536 dimensions
  3. Set the PINECONE_API_KEY to the key from step 1
  4. Set the PINECONE_ENVIRONMENT to whatever is the environment of your index

Google API

To autopopulate the metadata files, you'll need Google Cloud credentials. This is a google system, so of course is complicated and prone to arbitrary changes, but as of writing this the process is:

  1. Go to the Google Cloud Console
  2. Create a new project or select an existing project.
  3. Google sheets etc will have to be enabled
  4. Navigate to the "Credentials" section, and to + Create Credentials.
  5. Select "Service Account"
  6. Fill in the required information for the service account:
    1. A descriptive name, a short service account ID, and description. Press Create and Continue
    2. Leave the optional sections empty
  7. At https://console.cloud.google.com/apis/credentials?project=, select your new Service Account, and go to the KEYS section. Select ADD KEY, "Create New Key", the JSON key type and click "Create".
  8. The JSON file containing your credentials will be downloaded. Save it as credentials.json in the top-level directory of the project.
  9. Again in the "Credentials" section, + Create Credentials, select API key, and add the created API key as your YOUTUBE_API_KEY.

Once you have working credentials, you will be able to fetch data from public sheets and gdrive. For writing to sheets and drives, or accessing private ones within the code, you will need to request permissions to the owner of the particular sheet/gdrive.

Metadata updates

There are a couple of datasources that consist of singular articles (html, pdfs, ebooks, etc), rather than all the contents of a given website. These are managed in Google sheets. It's assumed that the contents of that document are clean, in that all required fields are set, and that there is a source_url pointing to a valid document. Rather than having to manually fill these fields, there is a magical script that automatically populates them from a messy input worksheet, which contains all kinds of info.

OpenAI API

  1. Go to the openai api website. Create an account if needed, and add payment information if needed.
  2. In https://platform.openai.com/account/api-keys, create a new secret key or use a used one.
  3. Add this secret key to the .env, as OPENAI_API_KEY.

Airtable API

The airtable we currently scrape is https://airtable.com/appbiNKDcn1sGPGOG/shro9Bx4f2i6QgtTM/tblSicSC1u6Ifddrq. #TODO: document how this is done / reproduceable

Testing

To run tests, from root directory run:

pytest .

CLI Usage

There are various commands available to interact with the datasets:

Adding New Datasets

Adding a new dataset consists of:

  1. Subclassing AlignmentDataset to implement any additional functionality needed, within align_data/sources/
  2. Creating an instance of your class somewhere, such as an init.py file (you can take inspiration on other such files)
  3. Adding the instance to DATASET_REGISTRY so it can be found

AlignmentDataset class

This is the main workhorse for processing datasets. The basic idea is that it provides a list of items to be processed, and after processing a given item, creates an article object, which is added to the MySQL database. The AlignmentDataset class has various methods that can be implemented to handle various cases. A few assumptions are made as to the data it will use, i.e.:

The AlignmentDataset is a dataclass, so it has a couple of settings that control it:

The basic processing flow is:

  1. self.setup() - any instance level initialization stuff should go here, e.g. fetching zip files with data
  2. self._load_outputted_items() - goes through articles in the database, loads the value of their self.done_key, and outputs a simplified version of these strings using normalize_url
  3. self.items_list - returns a list of items to be processed.
  4. self.fetch_entries() - for each of the resulting items:

Adding a new instance

There are Datasets defined for various types of data sources - first check if any of them match your use case. If so, it's just a matter of adding a new entry to the __init__.py module of the appropriate data source. If not, you'll have to add your own one - use the prexisting ones as examples. Either way, you should end up with an instance of an AlignmentDataset subclass added to one of the registries. If you add a new registry, make sure to add it to align_data.DATASET_REGISTRY.

Contributing

The scraper code and dataset are maintained by StampyAI / AI Safety Info. Learn more or join us on Rob Miles AI Discord server.

Citing the Dataset

The code is based on https://github.com/moirage/alignment-research-dataset. You can download version 1.0 of the dataset here. For more information, here is the paper and LessWrong post. Please use the following citation when using the dataset:

Kirchner, J. H., Smith, L., Thibodeau, J., McDonnell, K., and Reynolds, L. "Understanding AI alignment research: A Systematic Analysis." arXiv preprint arXiv:2022.4338861 (2022).