Data Citation Corpus

This project generates data dumps in JSON and CSV formats.

Requirements

Before running this project, please ensure that you have the following requirements installed on your machine:

PostgreSQL: You will need to have PostgreSQL installed. If you don't have it installed, you can download it from the official website: PostgreSQL
Python 3: You will need to have Python 3 installed. If you don't have it installed, you can download it from the official website: Python

To set up the project, follow these steps:

eg. chmod +x ./export-script/create_assertion_formatted_table.sh

./export-script/create_assertion_formatted_table.sh

./export-script/generate_assertion_details.sh

Create multiple SQL queries to create a table and populate it with related fomarmatted data following the spec document.
Create a bash script create_assertion_formatted_table.sh to automate the creation of the table.
Create a bash script generate_assertion_details.sh to generate the data dump files. This will create a JSON dump files from the fomarmatted table which we created using this bash script create_assertion_formatted_table.sh and convert each individual file to CSV using a Python script convert_to_csv.py following the spec document.

Accession Number Validation

This script is used to validate the accession numbers in our database against a set of regular expressions for each repository.

Create a .env file in the project root and add database credentials:

touch .env

Open the .env file and add the following lines:

DB_NAME=<database_name>
DB_USER=<database_username>
DB_PASSWORD=<database_password>
DB_HOST=<database_host>
DB_PORT=<database_port>

To run the accession_number_validation.py script, use the following command:


python accession_number_validation.py