This project generates data dumps in JSON and CSV formats.
Before running this project, please ensure that you have the following requirements installed on your machine:
PostgreSQL: You will need to have PostgreSQL installed. If you don't have it installed, you can download it from the official website: PostgreSQL
Python 3: You will need to have Python 3 installed. If you don't have it installed, you can download it from the official website: Python
To set up the project, follow these steps:
git clone git@github.com:datacite/corpus-data-file.git
cd corpus-data-file
.env
file, cp .env.example .env
, and add database credentialseg. chmod +x ./export-script/create_assertion_formatted_table.sh
./export-script/create_assertion_formatted_table.sh
./export-script/generate_assertion_details.sh
Create a bash script generate_assertion_details.sh to generate the data dump files. This will create a JSON dump files from the fomarmatted table which we created using this bash script create_assertion_formatted_table.sh and convert each individual file to CSV using a Python script convert_to_csv.py following the spec document.
This script is used to validate the accession numbers in our database against a set of regular expressions for each repository.
Ensure you have Python 3 installed on your system.
Navigate to the script directory:
cd accession_number_validation
Install the required Python packages:
pip install -r requirements.txt
Create a .env
file in the project root and add database credentials:
touch .env
Open the .env
file and add the following lines:
DB_NAME=<database_name>
DB_USER=<database_username>
DB_PASSWORD=<database_password>
DB_HOST=<database_host>
DB_PORT=<database_port>
To run the accession_number_validation.py
script, use the following command:
python accession_number_validation.py