datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

Data Citation Corpus

This project generates data dumps in JSON and CSV formats.

Requirements

Before running this project, please ensure that you have the following requirements installed on your machine:

Setup

To set up the project, follow these steps:

  1. Clone the repository: git clone git@github.com:datacite/corpus-data-file.git
  2. Navigate to the project directory: cd corpus-data-file
  3. Create a .env file, cp .env.example .env, and add database credentials

How to run script

Make scripts executable

eg. chmod +x ./export-script/create_assertion_formatted_table.sh

Create table with formatted data

./export-script/create_assertion_formatted_table.sh

Generate dump files

./export-script/generate_assertion_details.sh

Process behind generating dump files

This script is used to validate the accession numbers in our database against a set of regular expressions for each repository.

Setup

  1. Ensure you have Python 3 installed on your system.

  2. Navigate to the script directory:

    cd accession_number_validation
  3. Install the required Python packages:

    pip install -r requirements.txt
  4. Create a .env file in the project root and add database credentials:

    touch .env

    Open the .env file and add the following lines:

    DB_NAME=<database_name>
    DB_USER=<database_username>
    DB_PASSWORD=<database_password>
    DB_HOST=<database_host>
    DB_PORT=<database_port>

Running the Script

To run the accession_number_validation.py script, use the following command:


python accession_number_validation.py