billingross / genomics-england-challenge

Technical challenge for Genomics England bioinformatics engineer role

MIT License

0 stars 0 forks source link

Write lambda function to load refSNP data into database #4

Open billingross opened 1 month ago

billingross commented 1 month ago

Task description:

How would you ingest multiple small CSVs into a SQL database? What about as the size of the CSVs got larger?

Download small RefSNP data file to EC2 instance

wget ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON/refsnp-chrY.json.bz2

Approach A

Create an EC2 instance with enough disk to download RefSNP data for every chromosome
Write bash script to download json.bz2 file(s) from ftp to EC2
Decompress
Convert json to csv
Transfer to S3
Upload from S3 into Aurora PostgreSQL

billingross commented 1 month ago

Copy RefSNP data from EC2 to S3

aws s3 cp refsnp-chrY.json.bz2 s3://{my-bucket}/refsnp-chrY.json.bz2

billingross commented 1 month ago

With boto3 the official AWS SDK, I can read an object in S3 in chunks.

billingross commented 1 month ago

Read compressed object from s3: https://stackoverflow.com/questions/70773570/read-compressed-json-file-from-s3-in-chunks-and-write-each-chunk-to-parquet

billingross commented 1 month ago

Read BGZIP2 compressed file from S3:

#!/usr/bin/env python3

import boto3
import gzip
import bz2
import json

input_bucket = "" # bucket name
object_key = "" # object name

s3 = boto3.client('s3')
s3_object = s3.get_object(Bucket=input_bucket, Key=object_key)['Body']
with bz2.open(s3_object, "r") as f:
    for row in f:
        row = json.loads(row)
        # TODO: Handle each row as it comes in...
        print(row)

billingross commented 1 month ago

Created an Aurora PostgreSQL using Amazon RDS (database-1) with default connection to my existing EC2 instance.

billingross commented 1 month ago

Approaches to importing RefSNP data into Amazon RDS Aurora Postgres database

Yeah, just do that.

billingross commented 1 month ago

Need to install the psql utility to interact with my Postgres Database. Looking at the database Configuration tab, it looks like it is running Postgres v15.4 (Engine Version) so based on this post, I'm running the following command:

sudo yum install -y postgresql15

billingross commented 1 month ago

Using the Endpoint listed under the Connectivity & security panel as the host.

billingross commented 1 month ago

Database password is stored in AWS Secrets Manager. I found this out by clicking on "View Connection Details" in the green box that popped up after I created the database. Not sure how I would have found this otherwise.

billingross commented 1 month ago

I am connected to my Postgres database.

billingross commented 1 month ago

I can use wildcard expressions to bulk download TSV files.

wget ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON/refsnp-chr*.json.bz2.md5