Open billingross opened 1 month ago
Copy RefSNP data from EC2 to S3
aws s3 cp refsnp-chrY.json.bz2 s3://{my-bucket}/refsnp-chrY.json.bz2
With boto3
the official AWS SDK, I can read an object in S3 in chunks.
Read compressed object from s3: https://stackoverflow.com/questions/70773570/read-compressed-json-file-from-s3-in-chunks-and-write-each-chunk-to-parquet
Read BGZIP2 compressed file from S3:
#!/usr/bin/env python3
import boto3
import gzip
import bz2
import json
input_bucket = "" # bucket name
object_key = "" # object name
s3 = boto3.client('s3')
s3_object = s3.get_object(Bucket=input_bucket, Key=object_key)['Body']
with bz2.open(s3_object, "r") as f:
for row in f:
row = json.loads(row)
# TODO: Handle each row as it comes in...
print(row)
Created an Aurora PostgreSQL using Amazon RDS (database-1) with default connection to my existing EC2 instance.
Approaches to importing RefSNP data into Amazon RDS Aurora Postgres database
Yeah, just do that.
Need to install the psql
utility to interact with my Postgres Database. Looking at the database Configuration
tab, it looks like it is running Postgres v15.4 (Engine Version) so based on this post, I'm running the following command:
sudo yum install -y postgresql15
Using the Endpoint
listed under the Connectivity & security
panel as the host.
Database password is stored in AWS Secrets Manager. I found this out by clicking on "View Connection Details" in the green box that popped up after I created the database. Not sure how I would have found this otherwise.
I am connected to my Postgres database.
I can use wildcard expressions to bulk download TSV files.
wget ftp://ftp.ncbi.nih.gov/snp/latest_release/JSON/refsnp-chr*.json.bz2.md5
Task description:
Download small RefSNP data file to EC2 instance
Approach A
json.bz2
file(s) from ftp to EC2json
tocsv